Admissions committees are tasked with using selection measures with demonstrated validity evidence1 to choose which highly competitive applicants to invite to their training programs. However, this evidence must be balanced with the school’s social goals and mission. Traditional measures of academic ability, such as grade point average (GPA) and Medical College Admission Test (MCAT) scores, are used ubiquitously in U.S. medical school admissions, largely because of their strong psychometric qualities2 and ease of use. However, both GPAs and the MCAT have demonstrated lower performance scores for those traditionally underrepresented in medicine (UIM)3,4; thus, their sole use in admissions threatens to compromise the social mandate to widen access to medical education. Such tension arises when measures are strong on psychometrics but weak on diversity—a phenomenon known as the diversity–validity dilemma.5 Measurements of nonacademic competencies, on the other hand, have traditionally been fraught with more complex issues. For example, personal statements, reference letters, and other traditional methods of screening for nonacademic competencies are resource-intensive and/or lack validity evidence.6 Because psychometric evidence is necessary but insufficient to ensure the fairness and defensibility of selection processes,7 research on all selection measures must examine their defensibility in terms of both psychometrics and diversity implications.
Recently, the application of situational judgment tests (SJTs) to the assessment of nonacademic competencies has demonstrated evidence of psychometric strength8–10 while also being less likely to compromise the social mandate to widen access to medical education.11 A literature review has suggested that differences in mean scores between different demographic groups are much smaller for SJTs than for tests of academic ability,12 which is in line with the general literature illustrating smaller subgroup differences on assessments of nonacademic competencies than on assessments of academic competencies.13 The first published report on the impact of SJTs on the promotion of diversity11 for medical and dental schools compared results from the SJT and cognitive portions of the UK Clinical Aptitude Test (UKCAT). The findings indicated some mixed results for SJTs in the United Kingdom across three indices of diversity: socioeconomic status (SES), gender, and race. Compared with the cognitive portion of the test, the SJT effect sizes were reduced significantly for SES, reversed for gender, but unchanged—or, as indicated in a more recent technical report, exaggerated14—for race.
The primary aim of this study is to explore SJT effects on diversity in the U.S. context, as the differences observed in the United Kingdom may not necessarily generalize to the U.S. population. In the United Kingdom, most applicants apply directly to medical school after graduating secondary school, whereas entrance to medical school in the United States is most common after years of postsecondary undergraduate study. The two countries also differ in their demographic composition, with the United States being more demographically diverse, and, thus, the UIM composition of medical school applicants will also likely differ.15 Additionally, racial disparities are not consistent across the two countries’ cultures because their social and historical contexts differ. Although inequality has been observed in both places, there are differences in the way these inequalities manifest in their respective countries.16–18 Finally, it is important to note that the format of the academic and nonacademic tools used in medical school admissions differs across the two countries.8,9
In this study, we examined data from a single U.S. institution, whose applicants comprise approximately 20% of all U.S. medical school applicants. That is, we compared the magnitude of subgroup differences across screening tools used at the New York Medical College School Medicine (NYMC-SOM), including both academic screening tools (MCAT and GPA); an SJT, which was used to assess nonacademic competencies before interviews; and a multiple mini-interview (MMI), which was used to address nonacademic competencies at the interview stage. To address the primary goals of this study, the analyses were twofold: (1) What are the magnitudes of score differences across different demographic groups for three academic screening measures (GPA, old MCAT, and MCAT 2015), one nonacademic screening measure (SJT), and one nonacademic interview measure (MMI; analysis 1)? and (2) What are the demographic implications of including an SJT in the screening stage for the pool of applicants who are invited to interview (analysis 2)? Following previous work on the demographic differences on the UKCAT SJT,11 and in keeping with the mission and values of NYMC-SOM, we focused on four demographic variables—SES, gender, race, and ethnicity—to examine whether the magnitude of subgroup differences found in the UKCAT SJT, as compared with other screening measures, would also be found in the U.S. context (analysis 1) and whether the inclusion of an SJT, relative to an MCAT score and GPA, would change the number of UIM applicants invited to interview (analysis 2).
The medical school
We obtained institutional review board approval at NYMC-SOM, a private, MD-granting medical school with a mission guided by the values of humanism; diversity and inclusion; and a commitment to underserved and disadvantaged communities.19 These values inform the holistic admissions process used to select entering medical students. The NYMC-SOM student body represents a broad range of racial, ethnic, cultural, socioeconomic, and educational backgrounds. Half of the student population is female; African Americans and Hispanics/Latinos compose approximately 20% of the student population, and, similarly, 20% of students have a low-SES indicator.
Beginning with the 2015–2016 admissions cycle, NYMC-SOM added the Computer-based Assessment for Sampling Personal Characteristics (CASPer) into their admissions process as a requirement for all applicants, making it the first institution in the United States to do so. CASPer is a 12-section, online, predominantly video-stem-based, constructed-response test of nonacademic competencies (for details about the development and format of the test, initially piloted at a Canadian medical school and called the Computer-Based Multiple Sample Evaluation of Noncognitive Skills, see Dore and colleagues8). Blueprinting for the U.S. version of CASPer used by NYMC-SOM began with the Association of American Medical Colleges (AAMC) inter- and intrapersonal core competencies for entering medical students.20 Numerous stakeholders at NYMC-SOM provided guidance on content development for CASPer items to ensure cultural relevancy.
The test is taken by applicants on preset dates and times wherever the applicants are located and have Internet access. Unlike more common fixed-response SJTs, the CASPer SJT uses open-ended (constructed) responses requiring applicants to type a response describing not just what they would do but also why they would do it. Government-issued photo identification and webcams are used for both initial test-taker authentication and random in-test video capture for remote monitoring. For each of the 12 unique SJT sections, applicants type their responses to three open-ended probing questions within the 5-minute time limit. The entire length of the test is approximately 75 minutes, which includes a 15-minute break in the middle of the test.
Because of CASPer’s high interrater reliability,8 the number of raters per section has been decreased from two to one since 2011. Each rater evaluates a single section,21 such that a CASPer score comprises 12 unique, independent, and blinded impressions of an applicant’s performance. There are multiple parallel test forms, ensuring that each testing window has its own unique content, with the section order randomized for test security.22 Because applicants complete the test on different test dates across parallel test forms, applicant scores are standardized (i.e., converted to z scores) across all test takers within a single testing window. All SJT test fees are waived for applicants eligible for the AAMC Fee Assistance Program (FAP).
Basic psychometric findings for CASPer have been published previously.8 In separate CASPer administrations, overall test reliability (generalizability) ranged from G = 0.72 to G = 0.83, and interrater reliability ranged from R = 0.81 to R = 0.95.8 Scores are generated by diverse human raters, who are blinded to test takers’ personal identifiable information and have been given instructions to ignore minor spelling and grammar mistakes. CASPer has also shown disattenuated correlations with the MMI (R = 0.60)8 and predictive validity for future performance (up to seven years later) on national licensure subsections that test personal competencies, including cultural competency (R = 0.30–0.50).23 Finally, CASPer also correlates with the positions in which general surgery residency programs rank applicants on their rank lists.24
As mentioned above, beginning with the 2015–2016 admissions cycle (i.e., the cycle for the class of 2020), CASPer was mandatory for all applicants completing their secondary applications to NYMC-SOM. The program received 9,977 secondary applications for this cycle (approximately 20% of the national pool), of which 9,174 had taken CASPer. In this cycle, NYMC-SOM accepted scores from both the pre-2015 version of the MCAT (i.e., the old MCAT) and the 2015 version of the MCAT (i.e., MCAT 2015), allowing concurrent comparison of demographic differences between the two different MCAT versions as well as across selection measures (see below). A detailed breakdown of demographic characteristics based on MCAT version is provided in Table 1.
Analysis 1: Magnitude of subgroup differences across screening and interview measures.
We calculated diversity effect sizes (Cohen’s d) separately for five measures—GPA, the old MCAT, MCAT 2015, MMI, and CASPer—using bivariate analyses. We considered GPA, the old MCAT, and MCAT 2015 as assessments of academic/cognitive competency constructs and MMI and CASPer as assessments of nonacademic competency constructs. Only a subset of the eligible applicants (1,657 of 9,096, or 18.2%) were invited to participate in the MMI after screening measures and a holistic file review; thus, we did not necessarily expect the MMI samples to be representative of the overall applicant pool.
Based on the NYMC-SOM’s commitment to underserved communities (as described in its stated Policy on Diversity25) and definition of UIM, as well as using the same analyses of Lievens and colleagues,11 we considered four demographic categories—SES, gender, race, and ethnicity—which were represented by eight American Medical College Application Service (AMCAS) demographic variables/indicators (see Table 1). The reference group for the analysis was defined as the “traditional” applicant (i.e., male, white, non-Hispanic, and with no economic disadvantage). For low SES, we used three AMCAS indicators: Parental Education and Occupation Indicator Group 1 (EO-1; neither parent completed a bachelor’s degree), self-declared disadvantaged, and FAP-designated (total family income less than or equal to 300% of the national poverty level for a family of that size). Information on an applicant’s gender was extracted from AMCAS, while race and ethnicity were represented by the bivariate relationships of African American to white and Hispanic/Latino to non-Hispanic/non-Latino, respectively.
Analysis 2: Simulations of the composition of invited applicants.
As mentioned above, CASPer was first used at NYMC-SOM in the 2015–2016 admissions cycle. During this inaugural year, it was not used for decision making at any point in the admissions process, as it was adopted to pilot how the tool could be used in subsequent years. In the subsequent 2016–2017 admissions cycle, it was used as part of the holistic file review before interviews, but with no standardized process for how the scores from different measures should be weighed. Together with other confounding variables (e.g., differences in the composition of the initial pool of applicants), it is unclear from the 2016–2017 admissions cycle data how a systematic inclusion of CASPer into the admissions process would have impacted the number of UIM students invited to interview. Therefore, we conducted a series of “what-if” simulations to estimate the number of UIM applicants from the 2015–2016 admissions cycle who would have been invited to interview with different weightings of GPA, MCAT, and CASPer scores. In these simulations, applicants were ranked based on the following composites:
- Cognitive scores only (1/2 MCAT, 1/2 GPA),
- Equal distribution among the three scores (1/3 MCAT, 1/3 GPA, 1/3 CASPer),
- Equal distribution between the cognitive and nonacademic competency scores (1/4 MCAT, 1/4 GPA, 1/2 CASPer), and
- CASPer score only.
As the old MCAT will be phased out and the new MCAT (MCAT 2015) will be used with future cohorts, we conducted these simulations only with applicants who submitted MCAT 2015 scores. All data were analyzed using SPSS version 24 (IBM, Armonk, New York), and effect sizes were calculated using the SPSS syntax by Smithson.26
Here we first present the results of our bivariate analyses by gender, race, ethnicity, and SES (analysis 1) and then follow with the results of the what-if simulations (analysis 2).
Analysis 1: Magnitude of subgroup differences across screening and interview measures
The effect sizes for each selection tool are detailed in Figure 1 (gender, race, and ethnicity) and Figure 2 (SES indicators). Of the 9,174 applicants who completed CASPer, 3 self-identified as a gender other than male or female, and 75 had taken both the old MCAT and MCAT 2015. The former group was excluded because of the small sample size, and the latter was excluded for clarity of comparison between the two MCAT versions, leaving data from 9,096 applicants for analysis.
In line with previous findings,4 females scored lower on both the old MCAT and MCAT 2015 compared with men, but no significant difference in GPA was observed. Gender differences were reversed on the measures of nonacademic competencies, with females outperforming males on both CASPer and MMI (Figure 1). These results align with findings from the UKCAT SJT11 and the broader work on gender differences in nonacademic competencies.27,28
White applicants scored higher than African American applicants on all three academic metrics. Similar to the UKCAT SJT,11 white applicants also scored higher on CASPer than African American applicants, though this difference was significantly smaller than what was observed with the academic metrics. There was no significant difference between African Americans and whites on MMI performance (Figure 1).
Similarly, non-Hispanic/non-Latino applicants scored higher than Hispanic/Latino applicants on all academic metrics. Additionally, non-Hispanic/non-Latino applicants scored higher on CASPer than Hispanic/Latino applicants, but once again, this difference was significantly smaller than what was observed on the academic metrics. Non-Hispanic/non-Latino applicants received higher scores than Hispanic/Latino applicants on the MMI as well; however, this difference was significantly smaller than what was observed on the academic metrics and was similar in size to the difference that was found for Hispanic/Latino applicants on CASPer (Figure 1).
There were significant differences across all cognitive metrics on all three SES indicators: EO-1, self-declared disadvantaged, and FAP-designated (Figure 2). CASPer also showed significant differences across all SES indicators, but the effect sizes on two of the indicators (disadvantaged status and EO-1) were significantly smaller than those on the old MCAT, and differences on all three indicators were significantly smaller than those on MCAT 2015. EO-1 applicants performed worse on the MMI than did non–EO-1 applicants.
Analysis 2: Simulations of the composition of invited applicants
The results of the simulations are displayed in Table 2. Overall, the percentage of applicants invited to interview in the 2015–2016 admissions cycle was 18.2%, which, for the purpose of these simulations, would be 733 of the 4,025 students who had MCAT 2015 scores. As the weighting of CASPer increased, the number of females, African Americans, and Hispanic/Latinos increased. Self-declared disadvantaged and EO-1 applicants initially decreased when CASPer was included, though their numbers increased when only CASPer scores were used. There were too few FAP applicants in any of the simulations to form any conclusions. These simulations suggest that the inclusion of CASPer in the screening stage (i.e., before interviews) may increase the number of female, racial, and ethnic minority applicants, while the results are mixed for low-SES applicants, depending on how heavily CASPer is weighted.
Among NYMC-SOM applicants from the 2015–2016 admissions cycle, traditional academic/cognitive assessments used in medical school selection (GPA, old MCAT, MCAT 2015) showed markedly significant differences across SES, gender, race, and ethnicity variables/indicators. The differences in performance between UIMs and “traditional” applicants tended to be smaller or reversed with assessments of nonacademic competency (CASPer, MMI) compared with academic assessments. For example, when comparing low SES relative to no economic disadvantage, African Americans relative to whites, and Hispanics/Latinos relative to non-Hispanics/non-Latinos, CASPer demonstrated smaller but still evident group effect sizes, whereas the MMI demonstrated the absence of any significant subgroup differences with African Americans. However, this beneficial result has not been found with MMI use at other schools,29,30 so caution must be taken in the extrapolation of these MMI results.
The results of the simulations suggest that the inclusion of CASPer in the screening stage may help increase the number of female, racial, and ethnic minority applicants, while its impact on low-SES applicants was mixed. Anything that can help increase diversity among the pool of applicants who are invited to interview is particularly important, given other research findings that suggest there is a limited ability to enhance diversity once the interview stage has been reached.31 Regardless, any enhancement of diversity accrued through the implementation of SJTs should not in any way deflect from continuing efforts to promote diversity at all stages of the pipeline.32
Although not our primary aim, this study is also the first to present effect size differences between the older and newer version of the MCAT across demographic groups. During the creation of the new MCAT, “many stakeholders recognized the difficulty of using a standardized test to increase diversity in medical school admissions, but, even so, they hoped that the new exam would support efforts to increase racial, ethnic, socioeconomic, and other types of diversity.”33 Our findings mirror those of the AAMC’s annual report from 2019,34 which showed that score differences still persist across race, ethnicity, and SES. These results suggest that it may be particularly challenging to promote diversity in admissions through standardized academic/cognitive assessments.
There are limitations to this study. First, the findings are restricted to applicants to a private, Northeast suburban medical school, whose specific mission on diversity and inclusion focuses on two highly UIM groups, African Americans and Hispanics/Latinos. The results of this study may, therefore, not generalize to other UIM groups or schools, such as rural applicants who apply to state-funded medical schools, where providing medical care to rural communities is paramount to the mission. Second, compared with GPA and MCAT, the smaller effect sizes for CASPer, the SJT used in this study, should not be assumed to be indicative of all SJTs. CASPer uses some test format aspects that would tend to decrease effect sizes, such as the use of predominantly video-stem-based questions, behavioral (what would you do) rather than knowledge (what should you do) questions, general situation rather than task-specific situation scenarios, and constructed (i.e., free-text) rather than selected (e.g., multiple-choice) responses.35,36
Although caution should be exercised in deciding which SJT to use, the results of this study illustrate that the inclusion of an SJT in the admissions process has the potential to widen access to medical education for a number of underrepresented demographic groups and, thus, may help to address the diversity–validity dilemma by increasing the diversity of future physicians and thereby help address the needs of a changing patient population.
The authors wish to thank the following individuals: Mr. James Demaio for his invaluable data management and commitment to this project, and Dr. Winfred Arthur Jr, who demonstrated the remarkable ability of transmitting multiple piercing insights into a one-hour conversation.
1. Kane MT. Validating the interpretations and uses of test scores. J Educ Meas. 2013;50:1–73.
2. Julian ER. Validity of the Medical College Admission Test for predicting medical school performance. Acad Med. 2005;80:910–917.
3. Heller CA, Rúa SH, Mazumdar M, Moon JE, Bardes C, Gotto AM Jr.. Diversity efforts, admissions, and national rankings: Can we align priorities? Teach Learn Med. 2014;26:304–311.
4. Ramsbottom-Lucier M, Johnson MM, Elam CL. Age and gender differences in students’ preadmission qualifications and medical school performances. Acad Med. 1995;70:236–239.
5. Pyburn KM, Ployhart RE, Kravitz DA. The diversity–validity dilemma: Overview and legal context. Personnel Psychol. 2008;61:143–151.
6. Albanese MA, Snow MH, Skochelak SE, Huggett KN, Farrell PM. Assessing personal qualities in medical school admissions. Acad Med. 2003;78:313–321.
7. Dore KL, Roberts C, Wright S. Widening perspectives: Reframing the way we research selection. Adv Health Sci Educ Theory Pract. 2017;22:565–572.
8. Dore KL, Reiter HI, Eva KW, et al. Extending the interview to all medical school candidates—Computer-Based Multiple Sample Evaluation of Noncognitive Skills (CMSENS). Acad Med. 2009;84(10 suppl):S9–S12.
9. Patterson F, Cousans F, Edwards H, Rosselli A, Nicholson S, Wright B. The predictive validity of a text-based situational judgment test in undergraduate medical and dental school admissions. Acad Med. 2017;92:1250–1253.
10. Lievens F, Buyse T, Sackett PR. The operational validity of a video-based situational judgment test for medical college admissions: Illustrating the importance of matching predictor and criterion construct domains. J Appl Psychol. 2005;90:442–452.
11. Lievens F, Patterson F, Corstjens J, Martin S, Nicholson S. Widening access in selection using situational judgement tests: Evidence from the UKCAT. Med Educ. 2016;50:624–636.
12. Patterson F, Ashworth V, Zibarras L, Coan P, Kerrin M, O’Neill P. Evaluations of situational judgement tests to assess non-academic attributes in selection. Med Educ. 2012;46:850–868.
13. Schmitt N, Keeney J, Oswald FL, et al. Prediction of 4-year college student performance using cognitive and noncognitive predictors and the impact on demographic status of admitted students. J Appl Psychol. 2009;94:1479–1497.
16. Bécares L, Nazroo J, Jackson J, Heuvelman H. Ethnic density effects on health and experienced racism among Caribbean people in the US and England: A cross-national comparison. Soc Sci Med. 2012;75:2107–2115.
17. Nazroo J, Jackson J, Karlsen S, Torres M. The black diaspora and health inequalities in the US and England: Does where you go and how you get there make a difference? Sociol Health Illn. 2007;29:811–830.
18. Smedley A, Smedley BD. Race as biology is fiction, racism as a social problem is real: Anthropological and historical perspectives on the social construction of race. Am Psychol. 2005;60:16–26.
21. Dore KL, Hanson M, Reiter HI, Blanchard M, Deeth K, Eva KW. Medical school admissions: Enhancing the reliability and validity of an autobiographical screening tool. Acad Med. 2006;81(10 suppl):S70–S73.
22. Cluskey GR Jr, Ehlen CR, Raiborn MH. Thwarting online exam cheating without proctoring. J Acad Bus Ethics. 2011;4:1–7.
23. Dore KL, Reiter HI, Kreuger S, Norman GR. CASPer, an online pre-interview screen for personal/professional characteristics: Prediction of national licensure scores. Adv Health Sci Educ Theory Pract. 2017;22:327–336.
24. Shipper ES, Mazer LM, Merrell SB, Lin DT, Lau JN, Melcher ML. Pilot evaluation of the Computer-Based Assessment for Sampling Personal Characteristics test. J Surg Res. 2017;215:211–218.
26. Smithson M. Correct confidence intervals for various regression effect sizes and parameters: The importance of noncentral distributions in computing intervals. Educ Psychol Meas. 2001;61:605–632.
27. Toussaint L, Webb JR. Gender differences in the relationship between empathy and forgiveness. J Soc Psychol. 2005;145:673–685.
28. Roter DL, Hall JA, Aoki Y. Physician gender effects in medical communication: A meta-analytic review. JAMA. 2002;288:756–764.
29. Reiter H, Eva K. Vive la différence: The freedom and inherent responsibilities when designing and implementing multiple mini-interviews. Acad Med. 2018;93:969–971.
30. Henderson MC, Kelly CJ, Griffin E, et al. Medical school applicant characteristics associated with performance in multiple mini-interviews versus traditional interviews: A multi-institutional study. Acad Med. 2018;93:1029–1034.
31. Reiter HI, Lockyer J, Ziola B, Courneya CA, Eva K; Canadian Multiple Mini-Interview Research Alliance (CaMMIRA). Should efforts in favor of medical student diversity be focused during admissions or farther upstream? Acad Med. 2012;87:443–448.
32. Thomson WA, Denk JP. Promoting diversity in the medical school pipeline: A national overview. Acad Med. 1999;74:312–314.
33. Schwartzstein RM, Rosenfeld GC, Hilborn R, Oyewole SH, Mitchell K. Redesigning the MCAT exam: Balancing multiple perspectives. Acad Med. 2013;88:560–567.
35. Lievens F, Sackett PR. The effects of predictor method factors on selection outcomes: A modular approach to personnel selection procedures. J Appl Psychol. 2017;102:43–66.
© 2019 by the Association of American Medical Colleges
36. Edwards BD, Arthur W Jr.. An examination of factors contributing to a reduction in subgroup differences on a constructed-response paper-and-pencil test of scholastic achievement. J Appl Psychol. 2007;92:794–801.