Implicit in the design of a curriculum is the understanding that what students know and are able to do at the end of the program exceeds the sum of discrete learning achieved in each curricular component (i.e., the whole is greater than the sum of its parts). This is especially true in the health professions, as the complex curricular structures incorporate many competencies without a 1:1 relationship between particular knowledge/skill sets and curricular experiences. The formal curriculum is only part of students’ experience, and teaching contexts are highly variable. Program evaluation is often limited to parts of the curriculum (i.e., courses, clerkships), based on student-derived data (e.g., satisfaction, performance), and rarely blueprinted against the overarching outcome competencies around which curricula are now commonly designed.1,2 Much like an examination should be based on educational objectives, curriculum evaluation should be based on the competencies that students are expected to achieve.
Although the most feasible way to evaluate programs is to ask participating students for feedback, there are ample reasons to be skeptical of students’ responses. Self-assessment has been shown to be an invalid predictor of individual performance,3 and it is difficult to trust that individuals who have not yet practiced extensively in their field can adequately judge the success with which an educational program has prepared them for such practice. That said, as Surowiecki suggests in The Wisdom of Crowds,4 aggregating many flawed opinions can yield highly accurate information. Research by D’Eon and colleagues5 suggested similarly that aggregating flawed individual self-assessments could potentially provide important information for a group of individuals. Lam6 raised concerns about the data presented, leading D’Eon and Eva7 to determine that the data available to D’Eon et al were insufficient to properly test the hypothesis. They argued that examining the validity of aggregated self-assessments requires testing both the sensitivity (change in scores when change has actually occurred) and specificity (no change when performance remains relatively constant) of these instruments. Examining scores across multiple items would allow for a more continuous (and, hence, statistically powerful) examination of aggregated self-assessments than could be accomplished in the prior work by D’Eon and colleagues.
The purpose of the present study was to determine whether aggregated student judgment offers valid information for evaluating the relative strengths and weaknesses of an educational program. The results will be of interest to applied researchers and evaluators examining the effectiveness of programs.
This study evaluated the first two (preclerkship) years of the four-year undergraduate MD program at the University of British Columbia (UBC) Faculty of Medicine. The UBC behavioral research ethics board reviewed and approved the study.
During the period November 2009 to early February 2010, we developed a Readiness for Clerkship Survey de novo to evaluate the function of the entire preclerkship curriculum of UBC’s medical education program. To our knowledge, no other such instruments are available in the literature. To ensure that the items in the survey represented the constructs we intended to evaluate, we developed it using our knowledge of the structure, content, and aims of the preclerkship curriculum. We began to develop the survey by reviewing four existing documents that assess competence because we thought that these documents adequately represented the educational goals of the preclerkship phase of the curriculum: the (1) UBC MD exit competencies,8 (2) UBC end-of-rotation preceptor assessment of student performance form, (3) University of Toronto Preparation for Clerkship Survey Items (29 task statements used in an informal study for program improvement purposes), and (4) Readiness for Residency Survey being piloted at UBC. Two of us (L.P. and C.L.) participated in the development of the 28-item Readiness for Residency Survey using the UBC MD exit competencies8 and relevant sources in peer-reviewed and gray literature.
First, we generated a list of action statements that articulated competencies in cognitive skills, clinical skills, and professional behaviors expected of students at the end of their preclerkship training. We then mapped the statements back to the four source documents to identify gaps and redundant items and modified, deleted, or added items as needed. We solicited and incorporated feedback about the instrument from three faculty, one resident, and a program evaluator involved in the project as a consultant.
The final instrument consisted of 39 activities that we listed in a sequence reflecting a normal patient encounter (Supplemental Digital Appendix 1, http://links.lww.com/ACADMED/A104). We created two versions: (1) a student self-assessment version in which the current cohort of third-year students rated their ability to perform the activities listed during their early clinical rotations, and (2) a faculty version in which faculty members rated the level of competence, as a group, of the new third-year UBC medical students they supervised during the early months of the students’ clerkships. In both versions of the survey, participants rated items using the following scale: 1 = unacceptable level of competence, 2 = marginal level of competence, 3 = satisfactory level of competence, 4 = high level of competence, 5 = extremely high level of competence, and NA = unable to rate/not applicable. The instrument provided no further guidance.
Participants and procedures
All third-year students in the classes of 2011 (n = 263) and 2012 (n = 267) were eligible to participate. We obtained a list of faculty members teaching third-year students from each department involved in the clerkship rotations for each student cohort. The number of eligible faculty identified for the 2011 class was 1,477. Based on feedback from faculty who returned incomplete surveys, it was evident that not all individuals on the send-out list were actively teaching third-year students. For the next cohort, we asked senior departmental leadership to provide only the names of faculty who regularly supervised third-year students. The number of eligible faculty for the class of 2012 was 1,074.
We administered surveys electronically via one45 (one45 Software, Inc., Vancouver, BC, Canada), the Web-based evaluation software used at UBC, in February 2010 for the class of 2011 (about six months after clerkship began) and in November 2010 for the class of 2012 (about four months after clerkship began). Because the majority of family medicine preceptors were not currently using one45, we sent them the survey by fax and collected their responses the same way.
We conducted all analyses for the classes of 2011 and 2012 separately to examine the replicability of the results. We used descriptive statistics, Pearson correlation coefficients, and ANOVA to compare the ratings assigned to each item by student and faculty participants. We extracted the amount of variance attributable to raters, items, and the rater × item interaction using generalizability analyses, and we used the variances to determine the extent to which different items could reliably be differentiated from one another. This analysis (combined with the correlation between aggregated student and faculty ratings) provides an indication of the sensitivity/specificity of the scores to variability in achievement across different competencies. We used decision studies to determine how many raters of each type (students or faculty) are required to reliably differentiate aspects of a program’s effectiveness.
We conducted exploratory and confirmatory factor analyses (EFA and CFA, respectively) using Mplus version 6 (Statmodel, Los Angeles, California) to identify the factor structure of the new instrument using the 2011 data and to confirm this factor structure using 2012 data. The final factor structure was based on parallel analysis, statistical goodness of fit of the model, scree plot, and magnitude of factor loadings. We evaluated statistical goodness of fit using the following four indices: (1) comparative fit index (CFI), (2) Tucker–Lewis index (TLI), (3) root mean square error of approximation (RMSEA), and (4) standardized root mean square residual. We used values commonly accepted as evidence of good fit in relation to each index.9–11 If we found that items cross-loaded on more than one factor or failed to load on at least one factor, we flagged them for potential removal or modification. We considered factor loadings of 0.40 or greater meaningful12; however, we flagged factors that had a secondary factor loading between 0.30 and 0.40 for potential cross-loading.
For the class of 2011, 179/263 students (68%) and 384/1,477 faculty (26%) responded to the survey. For the class of 2012, these proportions were 171/267 (64%) and 419/1,074 (39%), respectively.
Are students able to reliably differentiate between aspects of competence they, as a group, possess?
We used ANOVA and generalizability theory techniques to extract the variance components attributable to rater, item, and the rater × item interaction (residual error). Table 1 reveals that there was considerable noise in the data, with roughly 50% of the variance being attributable to residual error. In general, the variance attributable to differences across competencies (item) was the smallest of the three variance components.
Converting these numbers into reliability coefficients with items as the facet of differentiation confirms that one individual’s opinions are not a reliable indication of which competencies are enabled by an educational program, given that the interrater reliability ranged from G = 0.08 to 0.32 (see Table 2). That said, a decision study we performed on the same data (also illustrated in Table 2) indicated that reliable differentiation (G = 0.8) between items can be achieved as long as responses are averaged across k = 9–45 raters (with faculty appearing to be less consistent in their impressions than students and, thereby, requiring a larger number of raters). Averaging across all raters in each sample for whom a complete data set was available (n = 89–137) yielded near-perfect reliability (G(n) = 0.88–0.98). The correlations between the average scores assigned to items in 2011 and those assigned to the same items in 2012 were near perfect (r = 0.99 for faculty; r = 0.98 for students).
How accurate are students’ judgments of competence when considered in the aggregate?
The correlation between item averages assigned by students and those assigned by faculty was r = 0.88 in 2011 and r = 0.91 in 2012. These results do not indicate that the absolute level of proficiency that students reported themselves to have achieved is accurate. Aggregated student self-assessments revealed the general tendency for students to be optimistic in their perceptions of their ability given that those ratings were statistically higher (mean = 3.35; 95% CI = 3.29–3.40) than faculty ratings of the same students (mean = 3.04; 95% CI = 2.97–3.10; F(1,451) = 53.4, P < .001). In other words, the high correlation between faculty and student ratings simply supports the notion that students can identify their relative strengths and weaknesses when treated as a group. In 38 out of 39 individual item comparisons for the 2011 sample, students’ average ratings were higher than those of faculty. The same was true for 34 out of 39 comparisons in the 2012 sample. The likelihood of this occurring by chance is P < .001 in both instances (see Supplemental Digital Table 1, http://links.lww.com/ACADMED/A104).
Do items cluster into meaningful constructs that align with a competency framework ?
Means and statistical comparisons for all individual items are illustrated in Supplemental Digital Table (http://links.lww.com/ACADMED/A104). We observed no statistically significant differences in the mean scores assigned to each of the 39 items when we compared 2011 faculty ratings with 2012 faculty ratings, and only 1 of 39 items revealed a difference when we compared student scores across years. For the sake of illustrating the consistency of the results obtained from the two cohorts of students and faculty, the 10 highest-rated and 10 lowest-rated competencies for the class of 2011, and the extent to which other cohorts include the same items, are shown in Table 3. The 10 highest-rated items were the same for both cohorts of students, and 8 of these competencies were also in the top 10 as indicated by the two cohorts of faculty. Nine out of 10 items on which students rated themselves lowest were the same for both cohorts of students, and 7 to 8 of these competencies were also in the bottom 10 as indicated by the two cohorts of faculty.
Overall, EFA and CFA for faculty and student data identified four items that did not meet the criteria for subscale loading and were excluded from further analysis (items 8, 22, 23, and 39; see Supplemental Digital Appendix 1, http://links.lww.com/ACADMED/A104). We combined the student cohorts for these analyses because of their low sample size. The EFA revealed a two-factor solution (for faculty and students, respectively, CFI = 0.94, 0.94 [good fit is >0.9], TLI = 0.94, 0.93 [good fit is >0.9], RMSEA = 0.07, 0.10 [good fit is <0.6], SRMR = 0.07, 0.08 [good fit is <0.08]). We labeled these two factors Clinical Skills and Knowledge Application (CSKA) and Working as a Professional (WP) (factor loadings are illustrated in Supplemental Digital Table 2, http://links.lww.com/ACADMED/A104). The CSKA subscale includes actions related to obtaining and interpreting information from and about the patient, and integrating this with prior knowledge leading to the formulation of a diagnosis and management plan (i.e., activities that are directly related to the medical expert role).1 The WP subscale includes items related to being responsible for and interacting with the patient and team, and demonstrating self-care and self-directed learning (i.e., activities that are more broadly linked to the roles of advocate, collaborator, communicator, professional, and scholar).1 Two of four items that did not meet the criteria for subscale loading related to the manager role (items 22 and 39), one related to the communicator role (item 8), and the last related to the advocate role (item 23).
To provide a more rigorous test of the two-factor model, a CFA was conducted on the 2012 faculty data. Those items that loaded on a specific factor in EFA also loaded on the same factor in CFA. This analysis provided an acceptable fit to the data (CFI = 0.96, TLI = 0.95, RMSEA = 0.08), confirming the two-factor structure (Supplemental Digital Table 2, http://links.lww.com/ACADMED/A104). Table 4 presents the means, standard deviations, and coefficient alphas for each subscale for all four samples. Corrected item–total correlations were above the lower bound of 0.30 considered acceptable for each item.13 For each sample, the WP subscale was rated higher than the CSKA subscale.
The results of this study indicate that aggregated student self-assessments, as measured by the Readiness for Clerkship Survey, provide valid information that can be used to evaluate a two-year undergraduate preclerkship program. We found that (1) aggregated student data reliably differentiate between aspects of competence attained in the first two years of undergraduate training, (2) students’ judgments of competence, when aggregated, correlate very well with faculty data, although students tend to assign higher scores to themselves, and (3) survey items cluster into meaningful constructs that align with a competency framework.
The interrater reliabilities reveal that knowing one person’s ratings of the various competencies reveals nothing other than that person’s ratings because judgments did not generalize between raters. Averaging across raters, however, revealed that the average score is quite a reliable indicator of which competencies are demonstrated better relative to others. Decision studies revealed that one can achieve interrater reliability in this domain with fewer students (9–21) than faculty (26–45). This is an important finding because medical students often experience high levels of survey burden in their programs. It suggests that a relatively small sample of students can be randomly selected to complete the survey without compromising reliability.
When we considered the self-assessments of many students in aggregate, we found that the judgments aligned very well with those of faculty raters. The correlation of item averages between faculty and students was very high (r > 0.85 in both years). This finding stands in sharp contrast to the r ≈ 0.3 typically observed when considering performance at the individual level.14 These results, along with the consistency we observed across cohorts, support the validity of the Readiness for Clerkship Survey as a group-level indicator of students’ postprogram skill. The difference in means between students and faculty suggests that, although the instrument appears to provide results that are useful in identifying both relative strengths and weaknesses that could inform program development and improvement, the absolute scores may be less trustworthy.
The findings help to resolve an issue put forward by D’Eon and colleagues.5 They found that aggregated self-assessment scores were higher after an educational workshop relative to pre intervention (as were scores gathered on a more objective test of domain knowledge). By analogy to diagnostic testing, such data suggest that aggregated self-assessments are sensitive measures of improvement, but do not speak to the specificity of those measurements. Increases in self-assessment scores may have resulted simply from participants’ assumption that engaging in an educational activity will improve one’s subject knowledge rather than from participants being actually aware (even at a group level) of knowledge gain. D’Eon and Eva confirmed this concern post hoc by splitting the participants into those for whom the external measure of knowledge indicated improvement and those for whom it did not.7 Both groups’ aggregated self-assessments increased post intervention, thereby discounting their validity for program evaluation. It is difficult, however, to know what to conclude from those findings because the data were not collected for the purpose of these analyses and because the median split is a coarse strategy that resulted in a somewhat small sample size in both subsamples. The current study overcame that problem by including a series of items such that a more continuous association between aggregated self-assessment and external judgment could be examined. Our findings suggest that, although self-assessment scores cannot be trusted at the individual level or to give absolute indication of a group’s ability, aggregated self-assessments from students do appear to hold promise as an indicator of a program’s relative strengths and weaknesses for the purpose of evaluating overall program effectiveness.
There are several limitations to this study that should be considered. First, the faculty response rate is lower than desirable. Although this response rate improved in the 2012 sample and the results were replicated across years, the response rate is likely underestimated as a result of our delivery methods. For example, we know that individuals who were not actively teaching received the survey, and we have no way to confirm what proportion of recipients actually received the fax sent to their attention. The decision studies reported in Table 2 reveal that the faculty samples were substantially larger than they needed to be for the sake of achieving reliable measurement, but low response rates always carry the threat that respondents may have been nonrepresentative of the population of interest. Second, using data from only one school, it is impossible to tell whether the ratings reflect the specific situation at UBC or represent a broader impression of the relative difficulty of achieving any particular competence. Nonetheless, the study offers a strategy for other institutions to adopt in addressing evaluation questions related to the overall outcomes of their program. We would emphasize that the specific items schools choose to use in an aggregated self-assessment model, such as the one studied here, should be tailored to the specific competencies that the institution strives to help students achieve at the particular level of training that is being studied.
Despite these limitations, we are confident in the construct/content validity of our data in part because of the thorough use of relevant source documents, the expert review, and the empirical study of the internal structure of the data set, conducted via factor analysis. The two-factor structure we identified aligns well with many current models of impression formation, which suggest that we have a tendency to judge people along two dimensions (academic and interpersonal; or, in this case, clinical skills/knowledge and the nonmedical expert roles important to clinical practice).15 It is possible that the four items that did not load on either scale represent different constructs or were not appropriately worded and require revision.
In summary, our results indicate that the newly developed Readiness for Clerkship Survey can be used to collect valid and reliable program-level data regarding the relative functioning of competencies embedded in a preclerkship curriculum. The natural extension of this work, in which we are currently engaged, is to evaluate the function of the four-year program in preparing students’ readiness for residency.
Acknowledgments: The authors wish to thank Dr. Martin Schreiber, University of Toronto Faculty of Medicine, for providing the University of Toronto Preparation for Clerkship Survey Items.
Funding/Support: No internal or external funding. This study was conducted by the Evaluation Studies Unit, Faculty of Medicine at the University of British Columbia as part of program evaluation.
Other disclosures: None.
Ethical approval: This study was approved by behavioral research ethics board at the University of British Columbia.