Pugnaire, Michele P.; Purwono, Urip; Zanetti, Mary L.; Carlin, Michele M.
Student evaluation ratings of teaching is widely regarded in higher education as the single most valid source of data on teaching effectiveness.1 The measurement of students’ perceptions and their satisfaction with their educational experiences is particularly important for medical educators in the preclinical as well as the clinical years for a number of reasons. In the preclinical years, for example, medical school courses differ substantially from other higher education curricula in that medical school courses tend to rely on multiple lecturers who each may deliver only one lecture.2 In the clerkship years, the diversity of ward-based educational experiences and anticipated variance among preceptors compel clinical years educators to closely monitor programmatic quality using student ratings of the teaching experiences provided to them in the clinical setting.3 Reinforcing this approach, medical education research supports the use of subjective student ratings as a measure of teaching effectiveness and quality.4 Thus, while diverse methods are often used to assess the quality of medical education, student ratings of the teaching experiences provided to them remain the foundational element of programmatic assessment for medical school curricula.
The AAMC Graduate Questionnaire (GQ) constitutes one example of a student rating survey that measures educational quality using students’ ratings of their undergraduate medical education experiences, including preclinical and clerkship teaching. Under the auspices of the AAMC, the GQ is administered to graduating medical students at the end of Year 4. Since first implemented in 1978, the GQ has been widely used to monitor curriculum quality across all four years and to measure outcomes of curriculum change within individual medical schools, and nationally across medical schools.
In addition to the GQ, individual medical schools also commonly administer their own “internal” written student surveys that provide student evaluation ratings contemporaneously within a given course or period of study. These surveys are typically completed at the end of a particular academic unit/block, course, clerkship, semester, and/or year. Most commonly, they are administered upon completion of an individual course or clerkship, so-called end-of-course or end-of-clerkship (EOC) evaluations. In addition, follow-up “alumni” surveys may also be administered to students beyond the completion of medical school, often at the end of the first postgraduate year. These so-called PGY1 surveys are used to measure changes in student ratings over time, as students progress through their postgraduate medical training.
While these types of surveys (EOC, GQ, and PGY1) provide useful measures of student perceptions at one point in time, it is not known how or how much students’ ratings of their medical education experience change across this time frame. In this study, the researchers conducted a comparative analysis of student rating outcomes from three different programmatic evaluation surveys administered at their institution at three different points in time. The three surveys were (1) the EOC evaluations, conducted after each required clerkship in year three and assessing overall clerkship experience and quality including, but not limited to curriculum, faculty teaching, and materials; (2) the AAMC GQ, conducted at the end of Year 4; and (3) the postgraduation survey (PGY1), administered to medical school alumni at the end of their internship year. The stability of student ratings across the six required clerkships was assessed by comparing ratings of each clerkship as measured in Year 3 EOC evaluations with ratings for the same six clerkships obtained 12–24 months later on the GQ. The stability of students’ ratings of their overall four-year medical school curriculum was assessed by comparing satisfaction ratings for selected items from the GQ with satisfaction ratings for the same items obtained one year later in the PGY1 survey administered upon completion of internship.
With approval from the IRB of the University of Massachusetts Medical School (UMMS), data from two cohorts of graduating students were examined in the study: the graduating class of 2000 (n = 100) and the graduating class of 2001 (n = 93). For each cohort, relevant student ratings from each of the six required EOC evaluations, the AAMC GQ, and the PGY1 survey were extracted and compiled. Students’ ratings of six third-year clerkships were examined (family medicine, medicine, obstetrics–gynecology, pediatrics, psychiatry, and surgery). For each of the clerkships, both EOC and GQ surveys assessed the same item: the “overall quality” of the clerkship's educational experience using a similar Likert rating scale.
Because the analysis required “matched” samples, this study excluded all data from those students who declined the release of their personally identified responses. However, as discussed in the conclusion section, this exclusion did not appear to have a substantial impact on the overall results of the study.
The PGY1 survey consisted of a three-page scantron survey mailed to all UMMS alumni one year after graduation. A lottery-based gift certificate was offered as an incentive to return surveys. The PGY1 survey included selected items extracted from the GQ administered the previous year and adapted to the PGY1 survey. For the classes of 2000 and 2001, the common items on both the GQ and PGY1 surveys were:
* Do you believe that the time devoted to your instruction was inadequate, appropriate, or excessive?
* Overall I am satisfied with the quality of my medical education; and
* I am confident that I have acquired the skills required to begin a residency.
On the PGY1 survey, these three items were configured so as to be comparable to the GQ survey and the same Likert-type scale was used. The PGY1 survey also assessed “adequacy of instructional time” for 29 out of 49 items noted on the GQ and these 29 items were reproduced verbatim in the PGY1 survey for the classes of 2000 and 2001.
Using matched data from the 2000 and 2001 cohorts, ratings from the EOC and GQ evaluations for each of the six clerkships were compared. Preliminary inspection of the data showed that the distribution of the responses was negatively skewed. On average, more than 80% of the participants responded either “good” or “excellent” to questions pertaining to their clerkship experience in both GQ and EOC questionnaires, leaving only a small percentage of responses in the less favorable categories. This trend was consistently observed across all three surveys. Given this propensity distribution of responses, the rating categories were combined by recoding all of the ratings other than “excellent” (e.g., “good,” “fair,” and “poor”) into “other,” yielding a dichotomous rating scale of “excellent” and “other.” The comparison was performed by examining the difference of the percentage of “excellent” ratings between the two surveys using an approximation to the binomial distribution. This bivariate statistical analysis tested a null hypothesis, which, simply stated, was that the difference in the proportion of “excellent” ratings between the EOC and the GQ for each of the six clerkships equals zero.
Using the same methodology, a subsequent analysis compared GQ and PGY1 ratings. As a first step, the “inadequate” and “excessive” ratings in response to the question of “adequacy of instructional time” were combined into a single category of “other,” yielding a dichotomous rating scale of “appropriate” and “other.” After calculating the percentage of “appropriate” ratings from the GQ and PGY1 surveys, the proportion of the two ratings were compared using the previously outlined statistical procedure. The remaining two questions pertaining to “satisfaction with my medical education” and “preparedness to begin a residency program” were analyzed using the similar methodology.
The class size for the respective cohorts was 100 (class of 2000) and 93 (class of 2001) and the response rates for all three surveys were generally high (100% for the GQ, 86–100% for the EOC, and 61–66% for the PGY1). The matched responses for the comparison groups were as follows: EOC versus GQ: n = 66–77 (class of 2000) and n = 67–71 (class of 2001); GQ versus PGY1: n = 58 (class of 2000) and n = 47 (class of 2001).
When comparing the EOC and GQ ratings for each clerkship, results of the statistical analysis indicated that proportions of “excellent” ratings were consistent (see Table 1), with no significant differences between the EOC and GQ across all six clerkships (p values ranged from .18–1.94). This pattern remained stable and reproducible for both classes of 2000 and 2001.
Analysis of the GQ and PGY1 surveys for the ratings of the “adequacy of instructional time” items also revealed notable consistency across the GQ and PGY1 with significant differences in only one (Geriatrics) of the 29 items for the class of 2000 (z = 1.98; p < .05) and in five (Long Term Health Care, Clinical Pharmacology, Risk Assessment and Counseling, Law and Medicine, Family Dynamics) of the 29 items for the class of 2001 (z = 2.04–2.76; p < .05). Thus for the large majority of these 29 items, the rating pattern was consistent from the GQ to PGY1 survey and that pattern was reproducible across both classes of 2000 and 2001 (see Table 2).
When analyzing the GQ and PGY1 survey ratings for “preparedness for residency” and “satisfaction with medical education” items, the responses were greatly skewed, with the preponderance of ratings in the “Strongly Agree” or “Agree” category. As with the prior comparisons, overall students’ satisfaction with medical education (combined category ratings “Strongly Agree” and “Agree”) was also consistent over time, with no significant differences across the GQ and PGY1 ratings for both cohorts. By contrast, the percentage of “Strongly Agree” ratings for “residency preparedness” was significantly greater in the PGY1 survey compared to the GQ, and this was reproducible for both cohorts. As shown in Table 2, both cohorts demonstrated a sizable increase in the percentage of “Strongly Agree” ratings in the PGY1 survey, with a gain of 21% in 2000 and 32% in 2001 (z = 2.26–3.11; p < .05).
The comparison of ratings from EOC and GQ surveys demonstrated that students’ overall ratings of third-year clerkships were relatively stable over time, and that this pattern of rating stability was reproducible across two consecutive cohorts of classes. Similarly, this consistency in student ratings over time was also observed across the GQ and PGY1 surveys. As with the clerkship ratings, this pattern was reproducible across two consecutive cohorts of graduating classes, with the exception of the ratings for “I am confident that I have acquired the skills required to begin a residency.” For this particular item, there was a large and statistically significant shift towards more favorable ratings after the internship year and the magnitude of this shift was also comparable across the two cohorts. The reproducibility of this favorable shift across cohorts suggested a general consistency with regard to the impact of the internship year on student perception. Possible interpretations were that students undervalued the skill set acquired in medical school, that they overrated the anticipated challenges of residency, or that they were underconfident in their own readiness. Irrespective of the interpretation, these data suggested that students’ perceived confidence in their skills was more favorable after the internship as compared to graduation.
It is also possible that the differences that were observed may have resulted from other factors, including curricular changes occurring within the institution during the period of the study. We did not choose to further investigate curricular changes or other causal factors that may have explained differences in student perception, as this was not the primary focus of the study. Furthermore, it should be noted that for those items that showed significant change, the general trend for both cohorts was towards more favorable ratings after internship as compared to the end of medical school. This suggests that despite any changes in curriculum that may have occurred during this time, the trend of the changes was generally predictable, in favor of more positive perceptions after PGY1.
With these notable exceptions, these outcomes suggested that for the studied medical school, GQ ratings were, in general, reflective of student perceptions of educational experiences as measured in surveys administered one year before and one year after graduation. The study was limited by the exclusion of ratings from those students who declined to release their personal identifying data from the AAMC. For the cohorts examined, 17% in the class of 2000 and 14% in the class of 2001 were excluded from the study. As reported by Hodgson et al.,5 the ratings of those students who declined to release their data disproportionately represented the less “favorable” responses to the GQ items. While their finding could preferentially skew the results of this study towards more favorable ratings, this would not necessarily bias the stability of the ratings over time for those students who elected to release their data. Furthermore, for those students releasing their data, the reproducibility of a stable pattern of GQ ratings was clearly demonstrated across two separate cohorts. This suggests that irrespective of those students declining to release their identifying information, the GQ ratings appeared to be remarkably stable over time for the majority of items examined, with the previously noted exceptions. It is also notable that for these items, the shifts observed over time were consistently favorable and generally reproducible across two cohorts.
These findings benefited the studied medical school by expanding the applicability of the GQ in assessing selected aspects of the educational program, both retrospectively and prospectively, while utilizing rating discrepancies in the specific areas as valuable feedback for curricular improvement. It is not known if the stability of GQ ratings applies more broadly to other GQ items or if other medical schools would demonstrate comparable trends for their GQ outcomes.
1. Nelson MS. Peer Evaluation of teaching: an approach whose time has come. Acad Med. 1998;73:4–5.
2. Leamon MH, Servis ME, Canning RD, Searles RC. A comparison of student evaluations and faculty peer evaluations of faculty lectures. Acad Med. 1999;74:22–4.
3. Mazor K, Clauser B, Cohen A, Alper E, Pugnaire M. The dependability of students’ ratings of preceptors. Acad Med. 1999;74:19–21.
4. Griffith CH, Wilson JH, Haist SA, Ramsbottom-Lucier M. Relationships of how well attending physicians teach to their students’ performances and residency choices. Acad Med. 1997;72:118–20.
5. Hodgson CS, Teherani A, Guiton G, Wilkerson L. The relationship between student anonymity and responses from two medical schools on the Association of American Medical Colleges’ Graduation Questionnaire. Acad Med. 2002;77(10 suppl):S48–50.