A constant struggle all undergraduate medical education programs grapple with is selecting learners who will adapt well to medical studies and mature to achieve competency. While many methods exist to identify such learners, our search of the literature revealed that none are cited as being able to predict student outcomes. This lack of reliable predictive tools makes the science of medical student selection elusive. While programs and educators are committed to supporting all learners, a method that would reliably allow for early intervention could help programs refocus their resources and improve individual student outcomes. Thus, the integration of mathematical models of prediction for decision making in medical student selection may allow for more calculated and informed decisions at admissions and during medical school. This paper, therefore, explores the use of multilevel modeling to predict learners’ risk of failure on the Canadian national licensing examination—the Medical Council of Canada Qualifying Examination Part 1 (MCCQE1).
Considerable research has been conducted to predict licensing examination scores.1–15 These studies have focused predominantly on the predictive validity that prematriculation variables, such as Medical College Admission Test (MCAT) scores and undergraduate grade point average (GPA), have on future academic performance.1–4,16 However, it has been well documented that the predictive power of both MCAT scores and undergraduate GPAs decreases as students progress toward graduation and learning shifts from cognitive to more clinical measures.5 As a result, the reliability of these variables when used to predict the future performance of medical students is unclear, as many licensing examinations do not occur until closer to graduation.5,6
In Canada, for example, Eva et al7 examined the association between admissions variables (multiple-mini interview scores, autobiographical essay scores, and undergraduate GPAs) and performance on national licensing examinations. This study compared differences between the performance of matriculated students and those that were rejected but gained entry to medical school elsewhere in 2004 and 2005. Evidence from their analysis suggests that matriculated students had higher scores on national licensing examinations compared with those who were rejected.
In 2013, Woloschuk et al8 examined whether clinical performance observed in clerkships and during the first year of residency could predict pass/fail performance on the Medical Council of Canada Qualifying Examination Part 2 (MCCQE2) across four cohorts. They found that clerkship evaluations and year 1 residency ratings were significant but poor predictors of pass/fail performance on the MCCQE2. Similarly, in 2016, Pugh et al9 examined the association between objective structured clinical examinations (OSCEs) and national high-stakes examinations. Using data from eight cohorts, this study compared scores from an internal medicine residency OSCE progress test versus scores from the Royal College of Physicians and Surgeons of Canada Comprehensive Objective Examination in Internal Medicine. Results from their correlation and logistic regression analysis suggest that OSCE progress test scores and national high-stakes examinations of clinical competency were associated. These findings suggest that OSCE progress tests could be used to identify residents at risk of failing a future national high-stakes examination.
Since 2010, several studies conducted in the United States have suggested that medical student performance variables, such as course and assessment scores, are actually stronger predictors of licensing examination performance than prematriculation data.10–12 These studies suggest that year 2 student performance outcome variables are the best predictors of future academic risk on initial licensing examinations.
In 2015, Gullo et al10 found that MCAT scores combined with prematriculation math and science GPAs were strong predictors of United States Medical Licensing Examination (USMLE) Step 1 performance. However, when additional internal course-related assessment outcomes were added, the overall predictive ability of their model improved significantly. Similarly, Glaros et al11 were able to successfully predict initial licensing examination performance on the Comprehensive Osteopathic Medical Licensing Examination Level 1 using performance data obtained from the first year of medical school over three cohorts. Additionally, Coumarbatch et al12 used binary logistic regression models and receiver operating characteristic (ROC) curves to identify students at risk of failing the USMLE Step 1; their results showed that curricular year 2 cumulative averages and the MCAT biological sciences score were both significant predictors in identifying students at risk of failing initial licensing examinations.
The studies outlined above10–12 use similar methodologies to address the same important purpose—to identify students at risk. This paper builds on this robust work through the use of data from five cohorts and predictive models to identify student risk in advance of an initial licensing examination. Additionally, although there are parallels between the USMLEs and MCCQEs (both of which have been shown to be predictive of the quality of care provided in future practice17,18) in terms of basic science and medical expert content, these examinations are largely country specific. This makes results from studies conducted within the United States less applicable in predicting future student underperformance on Canadian national licensing examinations.
The purpose of our study was to address the following research questions using an analytic approach: Which admissions variables and curricular outcomes are predictive of being at risk of failing the MCCQE1? How quickly can student risk of failure be predicted? And to what extent is predictive modeling possible and accurate in estimating future student risk?
In Canada, all medical students take the MCCQE1 to receive an educational license for postgraduate training. This examination is typically written during the spring immediately following the successful completion of an MD program. While the undergraduate medical education curriculum objectives across Canada are similar, each school is autonomous in its selection of medical students, pedagogical approach, and assessment strategies. However, all Canadian medical schools adhere to rigorous accreditation standards set forth by the Association of Faculties of Medicine of Canada, and the formal accreditation process ensures that the educational requirements of all medical programs are comparable in quality, content, and preparing students for residency and professional practice.19
This study was situated at the Schulich School of Medicine & Dentistry, Western University, a midsized medical school in Canada that matriculates about 171 new students each year. Each student cohort is divided between two campuses: the main campus (London, Ontario) and the distributed campus (Windsor, Ontario). Although geographically separated, these campuses have comparable education offerings, equivalent assessments, and an identical curriculum.
The undergraduate curriculum at Schulich School of Medicine & Dentistry is a four-year, patient-centered, integrated curriculum composed of large-lecture, small-group, laboratory, and supervised clinical experiences. This pedagogical approach combines individual, problem-based small-group, active, and direct classroom learning in the preclerkship years (years 1 and 2). Year 3 consists of a single course—a yearlong integrated clerkship experience—while in the final year of the program (year 4), students participate in both clinical elective learning and a capstone transition course, which serves to enable students to build on their clinical experiences and prepare for residency.
Data and analysis
We retroactively collected 20 years of overlapping data from five cohorts of graduating students (2011–2015), with each cohort representing four years of data. We analyzed admissions variables and curricular outcomes data using hierarchical linear modeling (HLM) and sensitivity and specificity analysis. We used HLM7 (Scientific Software International, Inc., Skokie, Illinois) to develop our predictive models and IBM SPSS software, version 23 (IBM Corp., Armonk, New York) to produce the area under the ROC curve (AUC) to evaluate the models’ accuracy and determine whether they could be used to predict future risk of failure, using data collected from the 2016 graduating student cohort, which was collected for the sole purpose of testing the accuracy of our predictive models.
Hierarchical linear modeling
To account for the intrinsic hierarchal nature of the data (i.e., students were nested within cohorts), we used a two-level HLM to analyze performance outcomes on the MCCQE1. HLM is a multivariate statistical technique developed in the early 1980s20–22 that has been used across multiple fields from economics to sociology and developmental psychology.
Hierarchical generalized linear models (HGLMs) are extensions of HLM and applied when data are non-normally distributed or outcomes are binary. Because this study examines the probability of risk of whether a student is or is not at risk of failing the MCCQE1, we used HGLMs to produce our predictive models.
Sensitivity and specificity analysis
In applying the predictive models produced from the HGLM analysis, individual probabilities are produced for each student. These individual probabilities can then be compared with true binary outcomes using sensitivity and specificity analysis.23,24 Sensitivity is the true positive rate—that is, the proportion of all students identified as not at risk who were correctly identified as not at risk. Specificity is the true negative rate—that is, the proportion of all students identified as at risk who were correctly identified as at risk.
ROC curves are used in multiple fields to evaluate the accuracy of a binary classification determined through logistic regression or methods.25,26 ROC curves plot the sensitivity on the vertical axis by 1 − specificity on the horizontal axis. In other words, they examine the relationship between true positives and false positives. As part of this, the AUC is calculated.27 An AUC value of 0.5 represents random accuracy, while a value of 1 represents perfect accuracy in predicted outcomes to true outcomes; that is, the closer the AUC is to 1, the more accurate the prediction.
Variables and analysis
Predictive variables included the following measures: gender, location of high school education (rural vs. urban), undergraduate GPA, MCAT scores (verbal reasoning, and physical and biological sciences), admissions interview scores, campus location (London vs. Windsor), and curricular performance outcomes (years 1 and 2 course mean grades, years 1 and 2 cumulative averages, and year 4 summative OSCE score). Course mean grades are based on students’ overall performance within each course.
Age at matriculation was not included as a potential predictor within our analysis because there was minimal variation in age at matriculation observed across the cohorts. The average age of our incoming cohorts was 23, and there was no correlation between age at matriculation and our dependent variable. However, separate analyses were run to be inclusive of student’s age at matriculation, age at graduation, and program duration (in years); these analyses showed no difference on student risk of failure on the MCCQE1, and therefore, these variables did not impact our coefficient estimates (see below) and were not significant predictors of being at risk of failing the MCCQE1.
Of a total of 21 preclerkship courses,28 each emphasizing a specific physical system, we included three courses from years 1 and 2 in our analysis. These courses were identified on the basis of consultation with the undergraduate dean of curriculum as being relatively stable in content and difficulty over the 2011–2016 graduating cohorts.
The dependent variable is a dichotomized variable measuring student risk of failure on the MCCQE1, using a cutoff score of 450. Overall, the MCCQE1 has a national mean score of 500, standard deviation of 100, and pass score of 427. To err on the side of caution in our probability estimates, we assigned a conservative cutoff score of 450 to capture students that fell more than half of a standard deviation below the mean.
The MCCQE1 uses a scaled score based on a two-part written examination. Prior to 2015, the first part was equated from year to year and the second part was reestimated every year. Since 2015, the full examination is equated from year to year. There was also a change in the minimum score needed to pass the examination, from a previous cutoff score of 390 on the old 50–950 scale to 427 (which would have been 440 on the old scale), in 2015.
Aside from its use as a qualifying examination, the MCCQE1 also serves as a national standard for undergraduate medical programs in Canada and allows student performance to be compared across institutions.29 This examination is, therefore, a high-stakes, summative computer-based assessment, measuring both medical knowledge and clinical decision making.29
We developed predictive models through an iterative, stepwise process.30 First, we examined student characteristic variables such as gender, location of their high school education (rural vs. urban), and campus location (London vs. Windsor). After the predictive characteristic variables were determined, we added admissions variables, such as undergraduate GPA, interview scores, and MCAT scores. Lastly, we added curricular outcomes, such as years 1 and 2 course mean grades and cumulative averages and year 4 summative OSCE score, one year at a time. Because our predictive variables were inclusive of both standardized and unstandardized variables, we selected group-based centering to allow us to compare group mean differences across cohorts for each variable. We removed observations with missing data using listwise deletion at the analysis stage. An example of a general HGLM model used in this study is provided in Appendix 1.
Next, individual predictive models were created using variables identified within our model to estimate the coefficients for each set of variables and assess student risk of failure at admissions, year 1, year 2, and pre-MCCQE1 (or five months prior to the MCCQE1). Lastly, these models were applied separately to each cohort (2011–2015) using AUCs to evaluate their accuracy in predicting students at or not at risk of failing the MCCQE1. These models were also applied to the 2016 cohort to determine whether they could be used to predict future risk.
This study was reviewed by the Health Science Research Ethics Board at Western University and was determined to be exempt.
In total, 20 years of overlapping data were gathered across five cohorts of graduating students (4 years of data from each cohort) from 2011 to 2015, comprising 788 students. The number of students in each cohort ranged from 147 to 168, with an average number of 157 students per cohort. Table 1 provides the mean, standard deviation, and range of possible scores for the predictive variables used in the development of the HGLM analysis.
As shown in Table 1, the number of students per variable varies, as do the ranges of possible scores and means for each variable. However, the standard deviations within each group of variables are very similar. Additionally, 389 (49.4%) of all students were male and 348 (44.2%) were female.
On the basis of results from our HGLM analyses, we were able to identify the following five variables as significant predictors of being at risk of failure: gender, MCAT verbal reasoning score, two preclerkship course mean grades (year 1 course C and year 2 course A), and the year 4 summative OSCE score (Table 2).
These results showed that, on average, females have higher odds of scoring ≥ 450 on the MCCQE1 (i.e., of not being at risk of failure) than males, when controlling for all other variables. This finding may suggest that the gender performance gap on high-stakes medical examinations is narrowing. However, future research is needed to examine this gender-related performance gap further. Additionally, students with higher (relative to their cohort’s average) MCAT verbal reasoning scores, year 1 course C and year 2 course A mean grades, and year 4 summative OSCE scores have higher odds of not being at risk of failure.
After developing the predictive models (at admissions, year 1, year 2, and pre-MCCQE1), we examined how accurate we were in predicting student risk of failure by inputting data from each cohort separately. We then examined the accuracy of the models in predicting future risk using the 2016 cohort data. Table 3 provides the model accuracy results calculated from the AUC.
The predictive accuracy (AUC) of the models varies. Overall, the pre-MCCQE1 model is the most accurate at predicting a student’s risk of failing (AUC 0.66–0.93), while the admissions model is not an accurate predictor of being at risk of failing the MCCQE1 (AUC 0.25–0.47). The accuracy of the year 1, year 2, and pre-MCCQE1 models varies from year to year, with high levels of accuracy for the 2013, 2014, and 2015 cohorts. With the 2014 cohort, for example, we had an AUC of 0.90 (95% confidence interval 0.84, 0.97) in our year 2 model, demonstrating a strong ability to predict students being at risk. Overall, the models were less predictive when it came to predicting future performance with the 2016 cohort, for which we were only able to achieve a significant AUC with the pre-MCCQE1 model. This suggests that the models need to be evaluated and updated from time to time to ensure that they are accurately predicting students at risk.
This paper offers both an approach and evidence of the possibility and accuracy of predictive modeling in undergraduate medical education. Using 20 years of data across five cohorts (4 years of data from each cohort), we developed four predictive models and measured their accuracy in identifying student risk of failing a national licensing examination at admissions, year 1, year 2, and pre-MCCQE1. Outcomes from our HGLM analysis identified five key admissions variables and curricular outcomes that are predictive of students at risk of failing the MCCQE1, a national licensing examination. Similar to findings from previous studies, evidence from our models suggests that, while it is not possible to identify student risk during the admissions process, we can begin to identify and monitor students at risk of failure by the end of year 1 studies.10–12 However, these predictions must be further validated in year 2 and again pre-MCCQE1.
Findings from our AUC analyses suggest that the predictive accuracy of these models varied among the cohorts. However, as more variables were added to our model, we were able to better predict student risk of failure with increasing levels of accuracy. For the 2013, 2014, and 2015 cohorts, we were able to predict student risk of failure with high levels of accuracy. While the models were found to be less predictive in 2016, when estimating future student risk, we were still able to predict student risk of failure with some accuracy using our pre-MCCQE1 model, allowing for intervention depending on the student’s level of competency. The variation among cohorts suggests that these models may need to be evaluated from year to year to control for any curricular changes or differences within student populations.
Estimates produced from our AUCs will be used internally to create cutoff scores to classify students into risk categories (low, medium, or high). Student risk will then be assessed at multiple stages using our year 1, year 2, and pre-MCCQE1 models. Students identified as medium to high risk at any stage will be reviewed on a case-by-case basis (with other supporting documentation regarding student performance taken into consideration). Support and intervention will then be prioritized on the basis of student need.
From a programmatic standpoint, these results have the potential to allow educators and leaders to better support learners and minimize risk of future academic failure through early detection, coupled with effective intervention, in advance of national licensing examinations. Programs thirst for a reliable way to accurately identify and support learners at risk of academic failure; this study serves as an example of how predictive modeling can be used to identify underperformance. In an era of data-driven decision making and demand for greater transparency and accountability, Canadian medical schools are socially accountable to deliver on government funding for student education, and, given the relative homogeneity of learners before matriculation, leaders value tools to support effective formative decision making.
Using over 20 years of data across five cohorts, we were able to confirm many findings consistent with other cross-national studies.10–12,16 This paper offers programs and educators with a new approach to proactively identify and monitor students at potential academic risk quantitatively within the first years of medical school. Even though our findings indicate that we were able to accurately estimate a risk score for student failure (via AUCs) in advance of an initial licensing examination, there are a few limitations we would like to address.
First, our models are highly sensitive to variability over time. While this may be reflective of both pedagogical changes and student population differences from year to year, it makes estimation more challenging. However, we still found stable student estimates to be modestly accurate in our predictions. Nevertheless, our data suggest that schools adopting this approach need to be cautioned of a need for these models to be updated periodically to account for any curricular changes or student population differences. Additionally, while our results acknowledge prediction through analytics as a powerful curricular governance tool to identify and offer intervention to support learners, the estimates produced from these probabilities should be used as a guide, alongside other supporting documentation from course chairs and faculty as well as program governance indicators. Finally, although this study appears to have relied heavily on institution-specific variables (which could limit the generalizability of these findings), we constructed our predictive models to be inclusive of variables that all medical schools have access to and can analyze for prediction. As a result, we believe the methodology and modeling presented within this study could be effectively replicated at other universities.
Using an analytic approach to student data, in an effort to identify students at risk of failure early on, we believe we were able to systematically identify key predictive variables and offer a methodology that could be used to predict future student performance on national licensing examinations. Through the use of HGLM and AUC analyses, we were able to quantify student risk of academic failure on the MCCQE1 early on within program studies. Findings from these types of models could enable programs to better identify and monitor students at potential academic risk quantitatively and develop tailored early (potentially up to three years prior to this key career assessment) intervention strategies.
Future research is required to examine how forthcoming changes made to the new MCCQE131 may affect student performance as well as our ability to predict student risk going into the examination. We also propose that gender performance differences on the MCCQE1 should be further examined. Lastly, this study should be expanded to examine student risk of failure on the MCCQE2, which is offered two years post graduation.
In conclusion, our models and results suggest that medical schools may wish to consider adding an analytic approach to student data review to better predict licensing examination outcomes using variables within their curriculum. This could lead educators to effectively intervene early and offer tailored interventions to students seen to be at potential risk. These models may have the potential to enable programs to not only better predict future student performance but also to allow them to confidently identify, support, and improve the quality of program graduates.
Acknowledgments: The authors would like to thank Bertha Garcia, MD, FRCPC, vice dean of education and professor, Department of Pathology and Laboratory Medicine, Schulich School of Medicine & Dentistry, Western University, for her continued support and encouragement throughout this project. Additionally, the authors would like to thank Ruth Childs, PhD, professor, Department of Leadership, Higher and Adult Education, Ontario Institute for Studies in Education, University of Toronto, and Nancy McVittie, programmer/analyst, Information Services, Kathy Sadler, admissions program manager, and Matthew Longstaffe, undergraduate medical education acting program manager, all of Schulich School of Medicine & Dentistry, Western University.
1. Kreiter CD, Kreiter Y. A validity generalization perspective on the ability of undergraduate GPA and the Medical College Admission Test to predict important outcomes. Teach Learn Med. 2007;19:95–100.
2. Donnon T, Paolucci EO, Violato C. The predictive validity of the MCAT for medical school performance and medical board licensing examinations: A meta-analysis of the published research. Acad Med. 2007;82:100–106.
3. Julian ER. Validity of the Medical College Admission Test for predicting medical school performance. Acad Med. 2005;80:910–917.
4. Violato C, Donnon T. Does the medical college admission test predict clinical reasoning skills? A longitudinal study employing the Medical Council of Canada clinical reasoning examination. Acad Med. 2005;80(10 suppl):S14–S16.
5. Siu E, Reiter HI. Overview: What’s worked and what hasn’t as a guide towards predictive admissions tool development. Adv Health Sci Educ Theory Pract. 2009;14:759–775.
6. Kulatunga-Moruzi C, Norman GR. Validity of admissions measures in predicting performance outcomes: The contribution of cognitive and non-cognitive dimensions. Teach Learn Med. 2002;14:34–42.
7. Eva KW, Reiter HI, Rosenfeld J, Trinh K, Wood TJ, Norman GR. Association between a medical school admission process using the multiple mini-interview and national licensing examination scores. JAMA. 2012;308:2233–2240.
8. Woloschuk W, McLaughlin K, Wright B. Predicting performance on the Medical Council of Canada Qualifying Exam Part II. Teach Learn Med. 2013;25:237–241.
9. Pugh D, Bhanji F, Cole G, et al. Do OSCE progress test scores predict performance in a national high-stakes examination? Med Educ. 2016;50:351–358.
10. Gullo CA, McCarthy MJ, Shapiro JI, Miller BL. Predicting medical student success on licensure exams. Med Sci Educ. 2015;25:447–453.
11. Glaros AG, Hanson A, Adkison LR. Early prediction of medical student performance on initial licensing examinations. Med Sci Educ. 2014;24:291–295.
12. Coumarbatch J, Robinson L, Thomas R, Bridge PD. Strategies for identifying students at risk for USMLE Step 1 failure. Fam Med. 2010;42:105–110.
13. Schabort I, Mercuri M, Grierson LE. Predicting international medical graduate success on college certification examinations: Responding to the Thomson and Cohl judicial report on IMG selection. Can Fam Physician. 2014;60:e478–e484.
14. White CB, Dey EL, Fantone JC. Analysis of factors that predict clinical performance in medical school. Adv Health Sci Educ Theory Pract. 2009;14:455–464.
15. Dunleavy DM, Kroopnick MH, Dowd KW, Searcy CA, Zhao X. The predictive validity of the MCAT exam in relation to academic performance through medical school: A national cohort study of 2001–2004 matriculates. Acad Med. 2013;88:666–671.
16. Wiley A, Koenig JA. The validity of the Medical College Admission Test for predicting performance in the first two years of medical school. Acad Med. 1996;71(10 suppl):S83–S85.
17. Wenghofer E, Klass D, Abrahamowicz M, et al. Doctor scores on national qualifying examinations predict quality of care in future practice. Med Educ. 2009;43:1166–1173.
18. Tamblyn R, Abrahamowicz M, Dauphinee WD, et al. Association between licensure examination scores and practice in primary care. JAMA. 2002;288:3019–3026.
20. Goldstein H. Efficient statistical modelling of longitudinal data. Ann Hum Biol. 1986;13:129–141.
21. Raudenbush SW, Bryk AS. Examining correlates of diversity. J Educ Stat. 1987;12:241–269.
22. Wong GY, Mason WM. The hierarchical logistic regression model for multilevel analysis. JASA. 1985;80:513–524.
23. Fawcett T. An introduction to ROC analysis. Pattern Recognit Lett. 2006;27:861–874.
24. Metz CE. Basic principles of ROC analysis. Semin Nucl Med. 1978;8:283–298.
25. Hanley JA. Receiver operating characteristic (ROC) methodology: The state of the art. Crit Rev Diagn Imaging. 1989;29:307–335.
26. Rotello CM, Macmillan NA, Van Tassel G. Recall-to-reject in recognition: Evidence from ROC curves. J Mem Lang. 2000;43:67–88.
27. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36.
30. Tabachnick BG, Fidell LS. Using Multivariate Statistics. 2007.5th ed. Needham Heights, MA: Allyn & Bacon.
31. Cross C. Changes ahead for Medical Council of Canada qualifying exams. CMAJ. 2013;185:1565.
Example of a General Hierarchical Generalized Linear Modeling Model Used in a Study on Predicting Student Failure on a National Licensing Examination, Schulich School of Medicine & Dentistry, Western University, 2011–2015 Graduating Cohorts
is the probability that student i in cohort j scores a 1
is the intercept term
is the coefficient for the kth variable for student i in cohort j
is the kth variable value for student i in cohort j
is the random component of
is an effect of the reference variable in cohort j
is an effect of the kth variable in cohort j
is the overall effect of variables
is the coefficient for variable k
Using the model above, estimates for each of the variables were calculated. To calculate probability estimates of students who are or are not at risk, the following formula was used: