Program evaluation typically addresses several domains,1 including the curriculum itself, teachers, student experience, and outcomes.2 This last domain has also been referred to as efficacy outcomes.3 In the present study, we focused on student performance gain as one indicator of course quality. To quantify this gain, a valid and reliable assessment of student performance before and after exposure to a specific course was desirable. Well-designed objective examinations produce the most comprehensive data but are difficult to implement in a pre–post setting.
With that difficulty in mind, we recently developed a novel pre–post evaluation tool4 that uses repeated student self-assessments regarding specific learning outcomes to calculate gains in knowledge, skills, and attitudes. Although some preliminary data suggest that this tool provides reliable data, the fact that it uses student self-assessments might raise concernsregarding its validity. Student self-assessments have been studied in great detail.5–7 Self-assessment ability varies across individuals8 and can be influenced by contextual9 as well as social and individual features.5,10,11 In general, self-assessment is confounded by a number of factors and has thus been characterized as “a complicated, multifaceted, multipurpose phenomenon that involves a number of interacting cognitive processes.”9 It is mainly used to help learners identify their own learning needs, although the available literature yields conflicting results regarding its reliability and validity.12,13 The data suggest that the tools used to elicit self-assessments need to be as specific as possible to obtain accurate ratings.14
In a critical appraisal of research involving self-assessments, Lam15 demonstrated how the unit of analysis affects evaluation results: Calculations can be performed either at the individual level (i.e., using data obtained from individual students as singular data points) or at the group level (i.e., using aggregated data by computing group means). Findings from evaluation studies cannot be generalized across these levels because performance differences tend to be greater as the unit of analysis increases.16 Consequently, the level at which analyses are performed needs to be taken into account in studies using self-assessments to assess learning outcomes.
It has recently been suggested that both the sensitivity and specificity of evaluation tools must be assessed; the results of such assessment clearly depend on the level of analysis.17 Thus, our aims in the present study were to
- establish the criterion validity of performance gain derived from comparative student self-assessments on the group level (using an objective external criterion),
- establish the criterion validity of performance gain derived from comparative student self-assessments on the individual level,15 and
- define a quality cutoff for subjective performance gains that can be used to differentiate specific learning objectives with favorable performance gain from those with suboptimal gain.
We hypothesized that there would be good agreement between subjective and objective performance gain on the group level and that the association between the two measures would be weaker on the individual level. In addition, we hypothesized that a quality cutoff for subjective performance gain could be used to predict objective performance gain with favorable sensitivity and specificity.
As a help to readers, we created Table 1 to define some of the terms used in this study.
Setting and Participants
In Germany, students enter medical education at an average age of 20 years, following graduation from secondary school (see Chenot18 for more about German medical school education).
We conducted our study at Göttingen Medical School. Like most German medical schools, it offers a six-year curriculum comprising two preclinical and three clinical years followed by a practice year (three 16-week elective periods in different specialties). The number of students enrolled in each half-year of the three-year clinical curriculum is 145 (thus, a total of 870 students are typically enrolled in the six half-years).
The clinical curriculum has a modular structure; the sequence of modules is identical for all students. For this study, we chose as potential participants the 145 students who, in the spring of 2011, were taking part in the six-week, interdisciplinary cardiorespiratory module at the beginning of year 4 (clinical year 2). The course comprises lectures, seminars, skills training using simulators, clinical teaching sessions, and case-based learning in small groups. The main specialties contributing to the cardiorespiratory module are listed in Table 2.
For the purpose of this study, we collected both subjective and objective outcome measures from students at the beginning and at the end of the module. To ensure comparability of subjective and objective performance gains, we designed all data collection tools to provide ratings on a six-point scale.
Our institution’s Catalogue of Specific Learning Objectives lists 46 specific learning objectives for the cardiorespiratory module, 33 of which are mainly related to factual knowledge. We transformed the latter to statements closely reflecting each objective (e.g., “I know the definition of chronic bronchitis”; see Chart 1) and asked students to self-rate their performance levels on a six-point scale anchored at 1 (completely agree) and 6 (completely disagree). We scheduled 10 minutes for the completion of the self-assessment questionnaires.
On completion of the self-assessment questionnaires, we asked students to take a formative examination (50 minutes) consisting of 33 questions, each of which was closely matched to the learning objectives used for student self-assessments. Each question consisted of five true/false items, allowing for a total score of 0 to 5 for a specific question, according to which boxes were ticked (see a sample question presented in Chart 1). As a consequence, both objective and subjective student performance measures were coded on a six-point scale.
All data collection tools were developed by one of us (T.R.) and pilot-tested in a small group (n = 5) of students who had completed the cardiorespiratory module in winter term 2010–2011. Only minor adjustments were needed, as some questions in the formative exam were perceived as being too easy to answer.
Students received an e-mail outlining the study rationale four weeks before the module. We informed students that study participation was voluntary. As an incentive, we raffled 30 book vouchers each worth €25 (approximately $32) among all participants. We obtained basic demographic data (age and sex), self-assessments, and objective performance data during plenary sessions on the first day of the module (pretest, April 11, 2011) as well as three days before the end of the module (posttest, May 17, 2011). We used identical paper questionnaires at both time points.
Calculation of performance gain
We coded self-assessments from 1 (best possible option) to 6 (worst possible option), and transformed (i.e., inverted) formative examination results to similar scores. We used these data to calculate subjective and objective performance gains.
- Group-level analysis: We used mean values across the student cohort to calculate aggregated performance gain (APG) according to the following formula:
In this equation,
means initial self-assessment (sAPG) or examination score for a specific objective (oAPG);
means self-assessment (sAPG) or examination score for the same objective after the course (oAPG).
- Individual-level analysis: We used matched pre–post data of individual students to calculate individual performance gain (IPG) according to the following formula:
According to these formulas, the difference between (mean) scores is divided by initial scores (corrected by −1, owing to the scales’ being anchored at 1 as opposed to 0), thus accounting for initial performance levels. For a more detailed description of performance gain calculations, please refer to Raupach et al.4
- To establish criterion validity of performance gain calculated from self-assessments on the group level, we displayed sAPG and oAPG values for all 33 learning objectives in a Bland–Altman plot and calculated the correlation between the two measures as Pearson r.
- To assess the association between subjective and objective performance gains calculated from self-assessments on the individual level, we calculated the correlation between sIPG and oIPG values for all 33 learning objectives for each individual student, thus obtaining a correlation coefficient for each study participant.
- To define a quality cutoff for sAPG values, we performed a receiver operating characteristic (ROC) curve analysis, comparing subjective with objective performance gains. This type of analysis is commonly used to establish the sensitivity and specificity of diagnostic tests in clinical medicine.19,20 As a first step, we defined favorable performance gain on the basis of oAGP values: According to the formula for APG presented earlier, an APG of 50% corresponds to an increase from two (out of six) correct answers in the formative pretest to four correct answers in the posttest. The same APG value would be noted for an increase from four to five correct answers. Thus, we considered an oAPG of ≥50% for a specific objective to be a favorable performance gain. In the ROC analysis, we tested different sAPG quality cutoffs (0%–100%) with regard to their ability to discriminate between learning objectives with favorable and suboptimal performance gains in the formative examinations as judged by their oAPG values. The optimal sAPG quality cutoff was defined as the point on the ROC curve that combined maximal sensitivity with best possible specificity.
We used the statistical computing environment of R (see www.r-project.org) for the analyses and set significance levels to 5%. We present descriptive data as means and standard deviations (SDs), and we present correlations as Pearson r (95% confidence interval).
The institutional review board of Göttingen Medical School waived ethical approval because the study protocol was not deemed to represent biomedical or epidemiological research. We made every effort to comply with data protection rules. Study participation was voluntary, and all participants signed an informed consent form before entering the study. Analysis of individually matched student data was facilitated by asking students to create individual codes. No personal identification was required.
Response rate, participants’ characteristics, and examination results
Of the 145 module students, 83 (57.2%) completed the pre- and posttests. There were 33 men and 50 women (mean age 24.8, SD = 2.3 years). Each of the tests consisted of a self-assessment questionnaire and a formative examination. Cronbach alpha of the formative pre- and postexaminations was 0.89 and 0.87, respectively. In these examinations, students achieved percent scores of 17.6 (SD = 7.5) and 56.5 (SD = 8.9), respectively. Calculated oAPG values for specific learning objectives ranged from 5.3% (tuberculosis) to 90.7% (tetralogy of Fallot), with a mean of 47.7, SD = 23.5%. Overall, sAPG values derived from student self-assessments were slightly higher (mean 58.9, SD = 22.6%) and ranged from 12.0% (interstitial lung disease) to 95.3% (tetralogy of Fallot).
Criterion validity of sAPG values (group-level analysis)
The agreement between subjective and objective APG data for the 33 learning objectives is shown in a Bland–Altman plot (Figure 1). The association between sAPG and oAPG values as assessed by the correlation between the two measures was 0.78 (0.6–0.89), P < .0001. These findings support our hypothesis that performance gain calculated from aggregated self-assessments adequately reflects objectively measured performance gain on the level of specific learning objectives.
Criterion validity of sIPG values (individual-level analysis)
For each study participant, we used matched pre–post data derived from self-assessments and examination results to calculate individual correlation coefficients between subjective and objective IPG values for the 33 learning objectives. The dispersion of coefficients is illustrated in Figure 2. Individual correlation coefficients ranged from −0.09 to 0.69 with a mean r of 0.37. All but one student’s data yielded positive correlations. These findings support our hypothesis that the association between subjective and objective performance gain is stronger on the group level than the individual level.
Identification of a quality cutoff for sAPG values
Results of the ROC analysis are displayed in Figure 3. We estimated the quality cutoff for sAPG values to be at 54.7%, indicating that suboptimal performance gains had been achieved for learning objectives with lower values. When using this cutoff, sensitivity of the approach was 100% (i.e., all learning objectives with an APG of ≥50% in the formative examinations were correctly identified by subjective APGs), and specificity was 57% (i.e., 57% of learning objectives with suboptimal performance gain in the formative examination were correctly identified by subjective data). More important, the positive predictive value of sAPG values calculated from student self-assessments was 59% (i.e., three out of five learning objectives with an sAPG above the quality cutoff in fact yielded favorable oAPG values), and the negative predictive value was 100%, thus allowing the precise identification of learning objectives for which a suboptimal oAPG had been established.
The results of our study provide evidence of satisfactory criterion validity of an outcome-based evaluation tool that uses student self-assessments to estimate objective learning outcomes. Supporting our hypotheses, group-level analysis yielded a stronger association between subjective and objective performance gain than did individual-level analysis, and the quality cutoff derived from ROC analysis allowed reliable identification of learning objectives with suboptimal performance gain.
In addition to providing a critical appraisal of organizational and structural aspects of teaching, evaluation of undergraduate medical courses should address learning outcome as one important quality indicator.21,22 Results of summative examinations should be considered to reflect actual learning outcomes and have been included in models to assess teaching quality; in Germany, they have even been used to allocate resources across medical schools.23 However, utility of exam results is limited. First, a recent study24 indicated that summative examinations within medical schools frequently do not meet quality standards, thus questioning the validity of these exams’ results. Second, to calculate objective performance gain during a particular teaching module, repeat examinations would be needed, but curriculum-wide implementation of pre- and posttesting in each course is unrealistic. As a consequence, we aimed at creating an evaluation tool that can be used as a valid surrogate parameter of objective learning outcomes that is easy to use at low cost.
Given the impact of the unit of analysis on evaluation results,15 we calculated APG from means across the student cohort (i.e., group-level analysis) as well as from matched data points provided by individual students (individual-level analysis). As expected from the literature,15,16 we found that correlations between subjective and objective performance gain were much weaker when calculated on the individual level than when using cohort means as the primary data points. Accordingly, group-level analysis appeared to be the superior approach in the context of this study.
In summary, the evaluation tool described here generates data that might feed into a measurement of educational outcome as one variable to be used to inform program evaluation.3
In contrast to an earlier study that yielded similar results but was criticized for methodological weaknesses,25 we included not just 1 but 33 different learning objectives, calculated pre–post differences,26 controlled for initial performance levels, and used a self-assessment tool that was closely matched to the format of the external criterion. The tool does not require individual labeling of students, as performance gain regarding specific objectives is calculated from aggregated data (i.e., on the group level). The good agreement between performance gain calculated from repeated self-assessments and an external criterion (formative examinations) suggests that the evaluation tool provides valid estimates of learning outcome regarding specific learning objectives. This kind of information is hard to extract from global course ratings.27 At the same time, it is highly relevant for teaching coordinators trying to identify specific educational shortcomings. Another strength of our outcome-based evaluation tool is its ability to present students with specific and precise statements such as “I know the anatomical features of Fallot’s tetralogy.” Research indicates that the accuracy of self-assessments is correlated to task specificity (i.e., using precisely defined learning outcomes will yield more accurate self-assessments).11 Apparently, the validity of the tool is crucially dependent on the quality ofthestatements used for student self-assessments.
A change in self-assessments following exposure to a teaching intervention does not necessarily need to reflect that actual learning has taken place. For example, course participation may increase a student’s understanding of the complexity of the content taught, thereby also altering participants’ internal standards for self-assessments. This impact of teaching on individual benchmarking (also known as response shift bias)28 is likely to have been small in this study, as is indicated by the good agreement between subjective and objective performance gain. However, future research needs to determine whether using retrospective student ratings of initial performance levels would produce performance gains similar to the ones observed in our pre–post design.
One possible explanation for the low response rate observed in this study is that completing questionnaires and examinations was time-consuming, and the incentive we used (book vouchers) might not have served as a sufficient motivator to participate in the study. As a consequence, our sample is likely to contain a higher proportion of motivated students who were willing to spend extra time on study-related tasks.
Analysis of formative examination results revealed that scores achieved in the posttest were moderate. This may be explained by the fact that students were not used to this type of examination; in addition, formative examinations generally generate lower motivation to achieve high scores than do summative examinations.29 As a consequence, the sAPG quality cutoff identified in the ROC analysis might not necessarily be transferable to different examinations, courses, curricula, and medical schools. In different settings, adequate quality cutoff values need to be determined using objective pre- and posttests before inferring conclusions on learning outcomes from student self-assessments. Further research also needs to assess whether the evaluation tool can be used to estimate actual performance gain regarding practical skills and affective learning outcomes. The way the evaluation tool was designed should facilitate its use in all situations where learning objectives can be clearly defined, including but not limited to different medical curricula. It would be interesting to see if the tool would still produce valid results in other areas of higher education not related to medicine.
In this study we compared increases in knowledge measured in an objective examination with increases derived from student self-assessments and found a strong and highly significant correlation between the two measures in a group-level analysis but not in an individual-level analysis. Results of the ROC analysis indicate that subjective data can be used to discriminate learning objectives with favorable performance gain from those with suboptimal outcome. The evaluation tool is easy to implement, takes initial performance levels into account, and does not require extensive pre–post testing. It may thus assist medical teachers in identifying specific strengths and weaknesses of a particular course, thus paving the way for further improvements in teaching.
Funding/Support: This study was funded by a research program at the Faculty of Medicine, Georg-August-University Göttingen, Göttingen, Germany.
Other disclosures: None.
Ethical approval: The institutional review board of Göttingen Medical School waived ethics approval because the study protocol was not deemed to represent biomedical or epidemiological research.
1. McOwen KS, Bellini LM, Morrison G, Shea JA. The development and implementation of a health-system-wide evaluation system for education activities: Build it and they will come. Acad Med. 2009;84:1352–1359
2. Gibson KA, Boyle P, Black DA, Cunningham M, Grimm MC, McNeil HP. Enhancing evaluation in an undergraduate medical education program. Acad Med. 2008;83:787–793
3. Blumberg P. Multidimensional outcome considerations in assessing the efficacy of medical educational programs. Teach Learn Med. 2003;15:210–214
4. Raupach T, Münscher C, Beissbarth T, Burckhardt G, Pukrop T. Towards outcome-based programme evaluation: Using student comparative self-assessments to determine teaching effectiveness. Med Teach. 2011;33:e446–e453
5. Colthart I, Bagnall G, Evans A, et al. The effectiveness of self-assessment on the identification of learner needs, learner activity, and impact on clinical practice: BEME guide no. 10. Med Teach. 2008;30:124–145
6. Falchikov N, Boud D. Student self-assessment in higher education: A meta-analysis. Rev Educ Res.. 1989;59:395–430
7. Davis DA, Mazmanian PE, Fordis M, Van Harrison R, Thorpe KE, Perrier L. Accuracy of physician self-assessment compared with observed measures of competence: A systematic review. JAMA. 2006;296:1094–1102
8. Regehr G, Hodges B, Tiberius R, Lofchy J. Measuring self-assessment skills: An innovative relative ranking model. Acad Med. 1996;71(10 suppl):S52–S54
9. Eva KW, Regehr G. Self-assessment in the health professions: A reformulation and research agenda. Acad Med. 2005;80(10 suppl):S46–S54
10. Fitzgerald JT, White CB, Gruppen LD. A longitudinal study of self-assessment accuracy. Med Educ. 2003;37:645–649
11. Gordon MJ. A review of the validity and accuracy of self-assessments in health professions training. Acad Med. 1991;66:762–769
12. Ward M, Gruppen L, Regehr G. Measuring self-assessment: Current state of the art. Adv Health Sci Educ Theory Pract. 2002;7:63–80
13. Rozenblit L, Keil F. The misunderstood limits of folk science: An illusion of explanatory depth. Cogn Sci. 2002;26:521–562
14. Mandel LS, Goff BA, Lentz GM. Self-assessment of resident surgical skills: Is it feasible? Am J Obstet Gynecol. 2005;193:1817–1822
15. Lam TCM. Do self-assessments work to detect workshop success? Am J Eval. 2009;30:93–105
16. Hayman J, Rayder N, Stenner AJ, Madey DL. On aggregation, generalization, and utility in educational evaluation. Educ Eval Policy Anal. 1979;1:31–39
17. D’Eon MF, Eva KW. Self-assessments for workshop evaluations. Am J Eval. 2009;30:259–261
18. Chenot JF. Undergraduate medical education in Germany. Ger Med Sci. April 2, 2009;7:Doc02
19. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36
20. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115:928–935
21. Schiekirka S, Reinhardt D, Heim S, et al. Student perceptions of evaluation in undergraduate medical education: A qualitative study from one medical school. BMC Med Educ. 2012;12:45)
22. Kogan JR, Shea JA. Course evaluation in medical education. Teach Teach Educ. 2007;23:251–264
23. Herzig S, Marschall B, Nast-Kolb D, Soboll S, Rump LC, Higgers RD. Distribution of government funds according to teaching performance. A position paper of the associate deans for medical education in North Rhine-Westphalia. GMS Z Med Ausbild. 2007;24:Doc109
24. Möltner A, Duelli R, Resch F, Schultz J-H, Jünger J. School-specific assessment in German medical schools. GMS Z Med Ausbild. 2010;27:Doc44.
25. D’Eon M, Sadownik L, Harrison A, Nation J. Using self-assessments to detect workshop success. Am J Eval. 2008;29:92–98
26. Thompson BM, Rogers JC. Exploring the learning curve in medical education: Using self-assessment as a measure of learning. Acad Med. 2008;83(10 suppl):S86–S88
27. Raupach T, Schiekirka S, Münscher C, et al. Piloting an outcome-based programme evaluation tool in undergraduate medical education. GMS Z Med Ausbild. 2012;29:Doc44
28. Skeff KM, Stratos GA, Bergen MR. Evaluation of a medical faculty development program. Eval Health Prof. 1992;15:350–366
29. Raupach T, Hanneforth N, Anders S, Pukrop T, Th J ten Cate O, Harendza S. Impact of teaching and assessment format on electrocardiogram interpretation skills. Med Educ. 2010;44:731–740