To evaluate a health professions program, such as an undergraduate medical education program, efficient and meaningful ways to measure outcomes are needed. Current measures tend to be limited to individual components of a curriculum (e.g., courses, horizontal or longitudinal themes) and overlook the possibility that the “whole is greater than the sum of its parts.” Readiness for clerkship and readiness for residency are two important outcomes of the first four years of medical training that indicate whether a curriculum enables students to achieve the competencies expected midway through their undergraduate program and at the time of graduation, respectively. Although it is efficient and less costly than other strategies to ask students how much they learned, a growing body of evidence shows that self-assessment data are not trustworthy as measures of individual performance.1–3
The validity of a measurement instrument or data collection strategy, however, is not absolute in that scores that can safely be deemed as valid or invalid for one purpose (and with a particular population) may yield a completely different degree of reassurance when used to draw inferences in another context. For example, D’Eon et al4,5 showed that inherently flawed measures of an individual’s success (i.e., self-assessments) could be used to offer a meaningful evaluation of the success of an educational activity (i.e., workshops). Their success with this line of reasoning led us to investigate whether self-assessments held utility for large-scale program evaluation when the focus was similarly taken off of the individual and placed on the aggregate, in this case using it to rank order the aspects of competence attained by a group of medical students in preclerkship training.6 Development and application of the Readiness for Clerkship Survey (RfC) indicated that, despite there being idiosyncrasy in individual students’ judgments of themselves: (1) aggregated student data reliably differentiated between aspects of competence attained in the first two years of undergraduate training, (2) aggregated students’ judgments of competence correlated very well with faculty data even though students tended to assign higher scores to themselves, and (3) survey items clustered into meaningful constructs that align with a competency framework.
Given this novel application of self-assessments to program evaluation and the uniqueness of the results, it is important to test their replicability in different contexts. The purpose of this study was to test the generalizability of the aggregated self-assessment approach to program evaluation by developing a Readiness for Residency Survey (RfR) to evaluate an MD program as a whole, thereby extending our previous work. We were interested in determining whether our approach provided a reliable indicator of the relative effectiveness with which competencies were achieved after completion of a full medical education curriculum. As a construct validity check we also tested the hypothesis that residents are more qualified than new clerkship students and should thus score higher on the physician tasks (items) that are common across both surveys. The results of this study will be of interest to evaluators, education researchers, and curriculum leaders examining the overall effectiveness of medical education programs.
This study was conducted in the context of the four-year undergraduate MD program at the University of British Columbia (UBC) Faculty of Medicine. The first two years of the program are organized around a core of body system blocks that integrate clinical and basic sciences. Longitudinal experiences in those years provide an early introduction to communication and clinical skills, determinants of health, and family medicine. For the vast majority of students, clerkship consists of discipline-based clinical rotations in year 3, and electives, selectives, and a consolidation course in year 4. Others complete year 3 in a longitudinal integrated clerkship, typically in a small community. The UBC medical school, the only medical school in the province of British Columbia, has four campuses and is accredited by the Committee on Accreditation of Canadian Medical Schools and the Liaison Committee on Medical Education. The UBC behavioral research ethics board reviewed and approved the study.
To ensure that items in the residency survey represented the competencies expected of MDs upon graduation, we used the UBC graduate competencies7 to guide item development. They are based on the CanMEDS8 and CanMEDS Family Medicine9 competency frameworks. Similar to our RfC survey, we created two versions: one for residents and one for their supervisors. Resident respondents were asked to rate their own ability to perform the listed physician tasks at the beginning of residency using a behaviorally anchored scale we developed on the basis of the constructs of sophistication and independence.10 Supervisors were asked to rate their residents’ abilities on the same items using the same scale. This survey is available on request.
A pilot study was conducted in April 2012 with 57 first-year residents and 66 faculty supervisors using an initially developed 25-item survey. The data revealed very high correlations between the mean scores assigned to each item by residents and their supervisors (r = 0.93). Only 4 items differed significantly in their mean ratings between residents and faculty; faculty rated the residents higher than the residents rated themselves in each case. Generalizability theory analyses and decision studies showed that reliable data (G = 0.80) could be obtained with as few as 11 residents compared with 33 faculty supervisors. On the basis of these findings and our earlier work comparing medical students and faculty supervisors,6 we decided to focus only on resident self-assessments.
Further, on the basis of feedback obtained about the individual items, we concluded that it was necessary to deconstruct the listed physician tasks to better align the focus of the items with the medical curriculum. To this end, we reexamined the UBC graduate competencies7 and items from the previously published RfC survey6 to further develop and refine the residency items. In general, where items aligned between the two surveys, the scope of the physician task was revised in one or both surveys to reflect greater achievement expectations for residents. At the end of this iterative process, the RfR survey consisted of 42 items (Table 2). We solicited and incorporated feedback about the instrument and its administration from key constituents in postgraduate education, including the postgraduate associate dean, two postgraduate program directors, a site director in family medicine, and one first-year resident. On the final survey, the items are listed in a sequence reflecting a typical patient encounter. Resident respondents are asked to rate their own ability to do or perform the listed physician tasks at the beginning of residency using a scale based on the construct of independence: 1 = Almost always required guidance/assistance, 2 = Frequently required guidance/assistance, 3 = Sometimes required guidance/assistance, 4 = Rarely required guidance/assistance, 5 = Almost never required guidance/assistance, and NA = unable to rate/not applicable.
Participants and procedures
UBC graduates of the class of 2012 who entered a postgraduate training program in Canada or the United States in the disciplines of family medicine, internal medicine, pediatrics, psychiatry, obstetrics–gynecology, or general surgery were eligible to participate (n = 185). We administered the surveys to residents in UBC postgraduate programs (n = 89) electronically via one45 (one45 Software, Inc., Vancouver, British Columbia, Canada) in March 2013, about seven months after the start of residency training. We administered the survey to residents in other non-UBC postgraduate programs (n = 96) at the same time using Vovici (VERINT Systems Inc.) given that these trainees did not have access to one45. Data collection was anonymous, and the survey remained open for approximately three weeks. To increase participation rates, we offered residents the opportunity to enter a draw for an iPAD mini after completion of the survey using a mechanism that prevented linkage of respondent identity and survey responses.
As a construct validity check, we also retrieved data from a study of year 3 medical students in the class of 2014 that involved the revised and aligned version of the RfC survey. Both groups (students and residents) were compared on the 29 items that were common across the surveys to test the hypothesis that residents are more qualified than new clerks and, therefore, should score higher on a rating of their competence. Eighty-seven year 3 students were eligible, and their survey was administered under similar conditions to the UBC residents (i.e., delivering questions with the independence rating scale via one45 with anonymity and a three-week window in which to access the survey in November 2012 that began about four months after they transitioned to clerkship). We also compared the responses of UBC residents with those completing their postgraduate training outside of UBC as a discriminant validity check (i.e., to ensure that it was not the residents’ current training program that was driving their responses).
We extracted the variance components attributable to raters, items, and the rater × item interaction using ANOVA and generalizability analyses, and used these variances to determine the extent to which raters could reliably differentiate between the tasks described by each item. We next used decision studies to determine how many raters are required to achieve a reliability coefficient of 0.80. For the 29 items that were common between the clerkship and residency surveys, we used ANOVA to compare the ratings assigned to each item by each group. Similarly, we used descriptive statistics, Pearson and Spearman correlation coefficients, and ANOVA to compare the ratings assigned to each item on the RfR survey by internal and external residents. Effect size (η2) was interpreted as follows: small = 0.01 to 0.05, medium = 0.06 to 0.13, and large ≥ 0.14. Statistical differences with a medium effect size (≥ 0.06) were considered to be of practical significance.
Of eligible residents who were sent the residency survey, 53% (n = 47/89) of residents in UBC programs and 35% (n = 34/96) of residents in other residency programs responded, giving an overall response rate of 44% (n = 81/185). Of eligible medical students who were sent the clerkship survey, 51% (n = 44/87) responded.
Reliability of resident data
Table 1 reveals that the variance attributable to differences across items (the subject of measurement) was larger than the variance across raters, and approximately 50% of the variance in the scores was attributable to residual error. These values are very similar to those for the RfC survey (Table 1). Converting these numbers into reliability coefficients with items as the facet of differentiation confirms that one resident’s opinion is not a reliable indication of which competencies have been achieved as the interrater reliability was low, G = 0.32, similar to the value obtained for the clerkship survey, G = 0.34. However, decision studies performed on the same data showed that reliable differentiation (G = 0.8) between items can be achieved as long as responses are averaged across a minimum of nine residents, a value similar to that calculated for the clerkship survey (i.e., eight medical students). Averaging across all raters for whom a complete data set was available yielded near-perfect reliability (G[n = 62] = 0.97 for residents who completed the RfR survey compared with G[n = 27] = 0.93 for medical students who completed the RfC survey).
Comparison between resident and clerk responses
Table 2 illustrates the mean scores assigned to each item in sequence from highest to lowest as rated by the residents. Also presented in Table 2 are the corresponding data from the RfC survey. Physician tasks on which residents rated themselves highest related to the competencies of professional, communicator, collaborator, and medical expert. In contrast, no medical expert competencies achieved an average rating of 4.0 or were ranked among the top 10 highest rated by medical students. Areas rated lowest by residents were those related to diagnosis, management, and communicating bad or difficult news to a patient. Because the number and content of the survey items are not the same in the residency and clerkship surveys, it is not possible to compare every item or the total scale sum. For the 29 items that were common to both surveys, resident ratings were higher than student ratings on 25/29 items (P < .001 using the binomial probability formula). Nine of these differences were statistically significant with a moderate or larger effect size (P < .05 and η2 ≥ 0.06). Eight out of those 9 items were competencies that are in the medical expert domain of the CanMEDS framework (e.g., take a full medical history, perform a full physical exam, formulate a problem list, interpret key laboratory findings, and interpret relevant imaging reports). Of the 4 items where the resident rating was numerically lower than the student rating, the numeric difference did not achieve the cutoff for practical significance.
Comparison of ratings provided by residents in UBC and other postgraduate programs
Given that the focus of the RfR survey is intended to be on the training residents received in their undergraduate medical program, there is little reason to believe that the rank ordering of competencies should differ for those who remained at UBC for their postgraduate training relative to those who went elsewhere. As a further construct validity test, therefore, we correlated the item means assigned by these two groups and found it to be very high (r = 0.97). Nine of the 10 highest-and lowest-rated competencies were the same for both resident groups. Residents in external programs rated themselves numerically higher than internal residents on 38/42 items, which is higher than would be expected by chance alone (P < .001 using the binomial probability formula). However, only 7 of those 38 differences achieved statistical significance, and only 4 of these achieved a medium practical significance threshold (Table 3).
The RfR survey was designed to measure the overall effectiveness of a medical undergraduate program in enabling students to develop the required competencies appropriate for entry into the first year of postgraduate training. To our knowledge, there are no other such instruments reported in the literature, although aggregated self-assessments are beginning to be recognized as having utility for the evaluation of educational programs.11 The results of this study show that aggregated resident self-assessments reliably differentiate between aspects of competence attained over four years of undergraduate training. Our data revealed that a relatively small sample of residents can be randomly selected to complete the survey without compromising this reliability as G = 0.80 can be obtained with as few as nine residents. This finding replicates and demonstrates the generalizability of research conducted on the RfC.6 It suggests that the RfR survey is a valid and reliable instrument for evaluating overall program effectiveness, including areas of strength and weakness of a program in preparing students for residency training.
A number of important practical implications arise from these results. First, the evidence presented here indicates that residents generally assigned themselves higher ratings than year 3 medical students, thereby suggesting that repeated measures could be used to evaluate the effectiveness of the curriculum longitudinally by assessing whether or not medical students continue to increase their competency in performing specific physician tasks. Further, the consistency of responses provided by trainees who continued internally as residents in UBC’s residency programs versus those provided by external students who went on to other programs suggests that it is unnecessary to survey external residents. Surveying only internal residents is a major advantage given that access to these graduates is facilitated by the use of the home institution’s e-mail addresses. Considerably more time and effort is required to obtain contact information for the graduates who pursue training at other institutions. Internal residents do not appear to overestimate the effectiveness of the program, and if anything there was a slight tendency for students who moved elsewhere for residency to rate themselves higher than those who remained within UBC programs. Whether that reflects a real difference (e.g., because of higher-achieving candidates being accepted to very selective programs) or constitutes some form of biased perception is unknown. Regardless, the very high correlation between internal and external residents and the small number of differences in the means indicate that it is reasonable to reduce the cost and increase efficiency and response rate by surveying only internal residents (the only caveat being if a large proportion of graduates enroll in external programs and the internal sample is too small).
Although self-assessment scores are not reliable at the individual level and cannot be trusted to provide an absolute indication of a group’s ability,1–3 aggregated self-assessments from medical students and new residents do appear to hold promise as an indicator of a cohort of students’ relative strengths and weaknesses for the purpose of evaluating overall program effectiveness. Whereas highly rated competencies in the RfC survey by medical students resided within the CanMEDS domains of professional, communicator, and collaborator, five medical expert competencies emerged as particular strengths when the program as a whole was evaluated by residents. The rank ordering revealed that areas where residents would benefit from additional educational support were the more complex and difficult competencies related to diagnosis, management, and communicating bad or difficult news to a patient. Other competency-specific program evaluation tools that might confirm the validity of this rank ordering are not available, and the lack of data from other institutions makes it impossible to know whether the ratings reflect the specific MD program at UBC or represent a broader impression of the relative difficulty in achieving any particular competency. We are currently conducting a study involving three other medical schools to help address these and other limitations.
One such limitation is that the response rates are lower than desirable. The high degree of reliability observed reduces the need for a large sample, but imperfect response rates always have the inherent threat of a nonresponse bias. In addition, we limited our study population to residents from only six core specialties. This decision was made for the sake of methodological expediency as we chose to survey the largest postgraduate programs, but it is possible that residents who match to smaller programs would rate their general competencies differently, thereby increasing the number of respondents required to achieve high degrees of reliability. Further, we collected limited demographic data regarding the campuses at which the students completed their undergraduate MD programs and the locations of the residency programs in which they were currently enrolled. As such, we are presently unable to examine differences across subsets of respondents (e.g., across gender) or to offer an explanation for why the graduates who entered residency training outside of UBC rated themselves higher on some of the physician tasks than did graduates who remained in our institution for their residency training.
Despite these limitations, we are confident in the construct/content validity of our data in part because the strengths of this study include the thorough use of relevant source documents, inclusion of expert review, and an empirical and replicated study of the internal structure of our dataset.5 In fact, while the survey items were developed based on UBC’s expectations, they align well with at least 12 of the 13 task-based performance milestones recently published as Entrustable Professional Activities by the Association of American Medical Colleges.12 This provides further confidence that the content of the survey is applicable across medical schools. It is still conceivable that other programs would consider a different set of competencies to be more appropriate for their institution/context, but even if that were the case, this line of research offers value in the form of a strategy for other institutions to adopt in addressing evaluation questions related to the overall effectiveness of their program. While the specific items schools choose might vary, the use of an aggregated self-assessment model, such as the one studied here, appears to offer a feasible and robustly reliable mechanism to identify the relative strengths and weaknesses of a program regardless of the particular level of training being studied. Further research is needed to determine the generalizability of our findings in other health professions (e.g., nursing, pharmacy, physiotherapy).
The RfR survey can provide reliable data from a very small number of trainees and, when used in combination with the RfC survey, reveals the relative strengths and weaknesses a cohort of students has achieved on competencies expected midway through and at the time of graduation from an undergraduate medical curriculum. These surveys will be useful both for tracking cohort outcomes and for evaluating the effectiveness of programs longitudinally. We are currently testing these surveys to determine their utility as benchmarks to monitor curricular change over time.