McOwen, Katherine S.; Kogan, Jennifer R.; Shea, Judy A.
Section Editor(s): Peters, Antoinette S. PhD; Frye, Ann W. PhD
Substantial research describes psychometric characteristics of student-based evaluations of their courses, lectures, and small-group sessions such as reproducibility and validity of the ratings1–4 and examines the impact of practical concerns related to form format and student response styles.5,6 These studies focus on paper-and-pencil administration of end-of-course evaluations. However, many institutions are replacing paper-and-pencil evaluations with electronic evaluation systems.7 Among the benefits of Web-based evaluation systems are their ease of use and cost-effectiveness.8 Their use raises new measurement questions such as the impact of evaluation timing.
With electronic evaluations, it is feasible to make evaluations available for each teaching event (i.e., laboratory, lecture) immediately after it occurs, when learners’ memories are fresh.9 In contrast, research on evaluations most commonly focuses on paper evaluations administered at the end of a course, listing all course small groups/labs/ lecturers, some occurring a few days in the past and others weeks earlier, requiring the evaluator to think back. According to recency theory, described in the educational psychology literature,10 individuals tend to remember more recent items better. Thus, when events are evaluated proximate to their delivery, they will be more accurately remembered.
Within a typical medical school week, students are likely to experience both higher- and lower-quality teaching. Relying on the recency theory, students completing evaluations in a timely manner are more likely to recognize these differences and assign high ratings to some events and lower ratings to others, giving an “average” mean across events. Over time, students will continue to evaluate each event. However, their memories will fade, and it is likely the ratings will be less extreme (regression to the mean). Consequently, mean evaluation ratings will remain stable and will not be related to timing of the evaluation completion. Alternatively, primacy effect theory suggests that when comparing multiple events, the best and the worst events trigger a memory response that does not occur with other events.10 This would suggest greater rating stability (affecting both means and variances) among the best and worst events compared with average events.11
Our objective was to assess the relationship between when learners complete evaluations (specifically, the number of weeks elapsed between an event occurrence and when it was evaluated) and how events are evaluated, looking first at the overall impact and then stratifying events by quality. Given recency theory, we hypothesized that when comparing evaluations completed within fewer versus more elapsed weeks, the means would be similar but the variances would be smaller for evaluations submitted after more elapsed weeks. Our second hypothesis, based on primacy theory, was that differences in means and variances between evaluations submitted early and those submitted late would be greater for middle-rated/mid-quality events than the highest- or lowest-rated events. That is, if an event was truly excellent or poor, elapsed weeks would have less impact.
All evaluations were submitted in a Web-based evaluation system designed internally at the University of Pennsylvania School of Medicine. Evaluations became available on the day of the event (lecture/lab/small group) and remained open until the first week of the next calendar year. Each event’s “quality of session” was rated on a five-point scale (1 = poor, 2 = fair, 3 = good, 4 = very good, 5 = excellent). Students were required to complete evaluations to receive their final course examination grades.
Evaluations submitted for all events in 15 Fall 2006 preclinical courses were included: nine courses were first-year basic sciences, and six were second-year organ systems. Within these 15 courses, there were 384 lecture and 147 lab/small-group events (531 total events). Each event should have been evaluated by approximately 150 students for a total of approximately 79,650 evaluations. The analytic dataset has 71,472 evaluations (about a 90% response rate). Three hundred four students (152 MSI and 152 MSII) completed a mean of 235 evaluations per person (range, 2–291). This study was approved by the university’s institutional review board.
A categorical variable was created representing the number of elapsed weeks from the day of the event to the day of evaluation submission, where W1 is one week or less, W2 is more than one week but no more than two weeks, W3 is more than two weeks but no more than three weeks, and W4 is more than three weeks. These categories were chosen on the basis of observations of how students use the evaluation system. Generally, grades are returned within three to four weeks, so most students do their evaluations within this time frame. The mean number of days from event to evaluation in W4 was 33, the median was 29, and the range was 21 to 148.
First, to describe the data characteristics, an ANOVA was performed to determine whether there were differences in means or variances related to elapsed week categories. Effect sizes (Cohen’s D adjusted for large N values) were computed. An ordinary least square regression was done regressing event ratings on timing of evaluation (actual days elapsed), controlling for type of event (lecture versus small group/lab) and student year. Repeated-measures ANOVAs were done in datasets aggregated to the level of the event (N = 531) and within event stratified into quartiles by average event quality, with post hoc comparisons of pairs of means with the least significant differences test.
For all events (N = 71,472), there were differences in evaluation means (P < .001) related to elapsed weeks. Significant differences were observed between all pairs of elapsed week category means (all P < .001); however, the average effect size was small (0.06). Means for W1 to W4 were 3.81, 3.78, 3.74, and 3.88, respectively. Standard deviations for the elapsed week categories were 0.93, 0.92, 0.93, and 0.88 for W1 to W4, respectively. Standard deviations were smaller for the W4 category than for the W1 to W3 categories (all P < .001) (Table 1). Controlling for year of learner or event type did not change the results substantially. Regression results were consistent with the ANOVA results; the number of days elapsed between when an event occurred and when it was evaluated was not a sizable predictor (β = .003, R2 = 0.004).
When data were aggregated to the level of the lecture/lab/small-group event (N = 531), contrary to our hypothesis, evaluation means were significantly different among elapsed time categories (P < .001). The means were 3.74, 3.77, 3.77, and 3.89 for W1 to W4. Contrary to expectations, the W4 mean was different from all others (all P < .001, average effect size = 0.32). However, supporting our hypothesis, variation significantly decreased when evaluations were submitted after more elapsed weeks (P ≤ .001). Standard deviations by elapsed week W1 to W4 were 0.85, 0.82, 0.77, and 0.78, with significant differences between W1 and W2 when compared with W3 and W4 (all P < .002).
Results for the event data (N = 531) stratified by event quality quartiles are shown in Figure 1. Our expectation for more similarity among means in the highest and lowest quality strata than those in the midquality strata was not supported (P ≤ .05 for each quartile). Means were significantly different (P < .05) between W1/W2 and W3/W4 in the lowest quality stratum, between W1/W2/W3 and W4 in the middle two quality strata, and between W1/W3 and W2/W4 in the highest quality stratum. Contrary to our expectation of seeing more stability in variances within events in the highest and lowest quality strata, there were significant differences in variability within each quality stratum (P < .05 for each stratum). The overall trend was for significantly lower variability in W4 than in W1 (P < .05 for each stratum).
Most of what we know about the psychometrics of course evaluations is based on paper-and-pencil delivery methods administered at a single point in time.2–4 Increasing numbers of medical schools are adopting Web-based evaluation systems.5 Among their benefits7–9 is the opportunity to distribute evaluations closer in time to when lectures/labs occur. The novel contribution of this study is that evaluations are available to students on the day the event actually occurred. This is quite different from how evaluations are often administered, but it is increasingly feasible with the proliferation of Web-based evaluation systems. We asked whether the number of weeks elapsing between when a teaching event occurred and its evaluation was related to the evaluation outcome, and our expectations were guided by primacy and recency theories.10 Overall, we expected to find stability in means but a reduction in variability with an increasing number of elapsed weeks. However, we found that means were often higher with more elapsed weeks (though effect sizes were mostly small); sometimes, variability was lower. The same pattern generally held true within the highest- and lowest-rated events.
These results suggest that the time between when an event occurs and when it is evaluated only negligibly impacts the final outcome. The small effect sizes even in the face of statistical significance support the validity of student ratings—students tend to agree on what is and is not a quality teaching event.12
This study has several limitations. The data are from a single school’s evaluation system that requires students to complete evaluations to receive exam grades. We have not included peer evaluations or other evaluation outcomes to support the validity of the ratings. However, earlier work with these and similar data support the validity inferences.3 Moreover, the main variable of interest was a five-point scale with ratings skewed to the left, limiting the sensitivity. Only one gradient was labeled as poor on the basis of observations that students are reluctant to call anything poor (and faculty are even more reluctant to have ratings with this terminology on their teaching record). Choice of a scale with more gradients would surely increase variability; it may or may not change the main findings.
One other issue that deserves reiteration is that these data were collected in a system that requires students to do their evaluations before they receive their grades. In most cases, that is about three to four weeks after a course ends. Thus, a culture exists in which many students complete their evaluations in a timely manner, and we have very few students in the W4 period. This may reduce generalizability to other schools that use a more voluntary system, as well as those that give evaluations on the last day of a course.
Overall, even though some analyses reached statistical significance, these results suggest that timing of evaluation does not substantially matter, at least within a few weeks of the event. This should be reassuring to faculty who increasingly rely on student ratings of their teaching in the reappointment and promotion process and worry about the fairness and accuracy of such data—students who wait several weeks to evaluate a lecture see it similarly to those who registered their thoughts early. Future research should focus on how student leniency/stringency is related to timing. Within any evaluator group, there are consistently low and high raters.12,13 These so called “hawks” and “doves” tend to balance out each others’ extreme ratings overall.14 It would also be useful to think about how one might design alternative evaluation systems, perhaps using matrix sampling, that are less labor intensive for students.15 However, evaluating a sample of events may be difficult for students if they evaluate comparatively as suggested by primacy and recency theories.10 Future studies examining other Web-based evaluation features (e.g., format, anonymity) will continue to provide information regarding differences that might exist between traditional and electronic evaluations.
1 Kogan JR, Shea JA. Course evaluation in medical education. Teach Teach Educ. 2007;23:251–264.
2 Beckman TJ, Cook DA, Mandrekar JN. What is the validity evidence for assessments of clinical teaching? J Gen Intern Med. 2005;20:1159–1164.
3 Greenwald A. Validity concerns and usefulness of student ratings of instruction. Am Psychol. 1997;52:1182–1186.
4 Shores JH, Clearfield M, Alexander J. An index of students’ satisfaction with instruction. Acad Med. 2000;75(10 suppl):S106–S108.
5 Marsh HW, Roche LA. Making students’ evaluations of teaching effectiveness effective. Am Psychol. 1997;51:1187–1197.
6 Shea JA, Bellini LM. Evaluations of clinical faculty: The impact of level of learner and time of year. Teach Learn Med. 2002;14: 87–91.
7 Leung DYP, Kember D. Comparability of data gathered from paper and Web surveys. Res High Educ. 2005;46:571–591.
8 Rosenberg ME, Watson K, Miller W, Harris I, Valdivia TD. Development and implementation of a Web-based evaluation system for an internal medicine residency program. Acad Med. 2001;76:92–95.
9 D’Cunha J, Larson CE, Maddaus MA, Landis GH. An Internet-based evaluation system for a surgical residency program. J Am Coll Surg. 2003;196:905–910.
10 Leventhal L, Turcotte SJC, Abrami PC, Perry RP. Primacy/recency effects in student ratings of instruction: A reinterpretation of gain–loss effects. J Educ Psychol. 1983;75:692–704.
11 Turcotte SJC, Leventhal L. Gain–loss versus reinforcement–affect ordering of student ratings of teaching: Effect of rating instructions. J Educ Psychol. 1984;76:782–791.
12 Kreiter CD, Ferguson KJ. The empirical validity of straight-line responses on a clinical evaluation form. Acad Med. 2002;77:414–418.
13 Hoyt WT. Rater bias in psychological research: When is it a problem and what can we do about it? Psychol Methods. 2000;5:64–86.
14 Albanese MA. Challenges in using rater judgments in medical education. J Eval Clin Pract. 2000;6:305–319.
15 Kreiter CD, Lakshman V. Investigating the use of sampling for maximizing the efficiency of student-generated faculty teaching evaluations. Med Educ. 2005;39:171–175.