The need to improve patient safety in health care is widely recognized.1 To ensure a generation of quality-and-safety-conscious doctors, the Accreditation Council for Graduate Medical Education requires residents to demonstrate competency in practice-based learning and improvement (PBLI) and systems-based practice (SBP).2 Consequently, many residency programs have implemented a variety of quality improvement (QI) curricula.3–6 Although reflection on practice is an essential component of QI,7–10 research about physicians' reflections on practice improvement is limited.
Reflection is a key feature of QI7–10 and professionalism,11–13 and reflecting on personal practice is a prerequisite to identifying how QI methodologies can effect change.5,8,10,14 We used Sandars'15 definition of critical reflection in our research: “a meta-cognitive process that occurs before, during, and after situations with the purpose of developing greater understanding of both the self and the situation, so that future encounters with the situation are informed from the previous encounters.” The importance of reflection in QI can be understood through transformative learning theory,16 which explains how “activating events” (i.e., disorienting dilemmas) expose personal limitations, kindle reflections, and reveal one's falsely held assumptions. This transformative process involves testing fresh perspectives to develop new knowledge, attitudes, or behaviors. For example, a practicing physician may reflect on an adverse event, thus identifying ways to improve personal practice (PBLI) or the health care system (SBP). Ultimately, research on reflection will be necessary for promoting personal and systems improvement among physicians.
Researchers have previously validated instruments to assess levels of reflection among medical students and residents.17,18 For example, the Mayo Evaluation of Reflection on Improvement Tool (MERIT) was designed to measure the quality of residents' reflections on adverse events.18 MERIT has high internal consistency and interrater reliability and comprises the dimensions of personal reflection, systems reflection, and event merit.18 In this study, we (1) investigated the temporal stability of internal medicine residents' MERIT reflection scores across three years of training, (2) examined potential differences between MERIT subscale scores, and (3) determined associations between MERIT reflection scores and characteristics of residents and of adverse events. We hypothesized that residents' reflections on adverse events would improve over time and that personal reflection scores would be higher than systems reflection scores.
Study design and population
This was a three-year (2006–2009) longitudinal study of 48 categorical internal medicine residents graduating in 2009 from Mayo Clinic Rochester. As a part of their PBLI curriculum, all Mayo Clinic residents complete biannual written reflections on adverse events encountered in their personal practices. Residents are prompted to describe events from practice in which care was suboptimal, to reflect on the events from a personal and systems perspective, and then to rate the event's severity (near miss, minor impact, moderately severe impact, severe impact, or patient death) and preventability (yes or no). Each resident in the 2009 graduating class was asked to complete six reflections during their three years of training. This study was deemed exempt by the Mayo institutional review board.
MERIT is a previously validated instrument composed of 18 items structured on one or the other of two, four-point scales depending on the item (No, Somewhat, Almost, Yes or Bottom, Second, Third, Top quartile).18 Accordingly, higher MERIT item scores would indicate a deeper level of reflection on the adverse event described by the resident. Faculty members who participated in the original MERIT instrument development read and scored residents' written reflections using MERIT. Content validity for MERIT was based on national guidelines and iterative revisions by QI experts.
Regarding internal structure validity, the MERIT instrument was shown by factor analysis to discriminate between three dimensions: personal reflection, systems reflection, and event merit.18 By definition, reflection on personal practice is a written reflection in which residents consider their personal contributions to the event, question their own practices, and identify novel behaviors to positively affect patient safety.18Reflection on the system is a written reflection in which residents consider institutional or system contributions to the event, question current institutional standards, and identify next steps for systems improvements.18Reflection on problem of merit is a written resident reflection that is comprehensible and stands alone by its descriptions, is patient-centered, has relevance to other patients, and has potentially serious adverse impacts.18 In previous studies, the interrater reliability (intraclass correlation coefficient [ICC] range 0.73–0.89) and internal consistency reliability (Cronbach alpha range 0.83–0.93) of MERIT scores were good to excellent.18
Data collection and analysis
Two faculty investigators (C.M.W., M.M.D.), who were blind to the residents' identities, used MERIT to assess written reflections completed by the residents in the current study. These faculty ratings were averaged to generate overall and individual factor MERIT reflection scores (18 items on four-point scales). To show agreement on MERIT scores between the faculty investigators, we determined interrater reliability by calculating the overall ICC for two raters, averaged across all the items and completed reflections. Other study variables were training level (postgraduate year), gender (male or female), event preventability (yes or no), and event severity (near miss to death). Measures of event severity and preventability were based on the residents' determinations of the patient events as recorded in their written reflections. To determine the temporal stability of MERIT scores, repeated-measures ANOVA tests for no time effect were used to assess changes in MERIT scores across the three years of training. A repeated-measures ANOVA test of no factor effect was used to establish whether any MERIT factor mean scores differed significantly from the others. Additionally, paired t tests were used to look for differences between each of the three MERIT factor mean scores.
To seek out associations between MERIT scores and other meaningful variables, data arising from this repeated-measures design were analyzed using generalized estimating equations to evaluate associations between residents' individual and overall MERIT scores and between learner characteristics (level of training and gender) and adverse event characteristics (preventability and severity). Statistical significance was set at P <.01 to account for multiple comparisons. A sample size calculation indicated >90% power to detect a meaningful 0.5-point difference between MERIT reflection scores. Statistical analyses were conducted using SAS version 9.1 (SAS Institute Inc., Cary, North Carolina).
Reflection score comparisons and temporal stability
The number of residents who completed at least one biannual reflection was 47 (97.9%) in postgraduate year (PGY) 1, was 45 (93.8%) in PGY 2, and was 38 (79.2%) in PGY 3. Overall, the residents completed 240 of 288 (83.3%) possible reflections. The individual factor and overall MERIT mean reflection scores during each year of training are shown in Table 1. The overall ICC was determined to be 0.80, which represents very good interrater reliability. Repeated-measures ANOVA tests revealed that there were no significant changes in the MERIT individual factor or overall factor mean scores across three years of training (37 residents completed reflections during all three years; all P values >.01; see Table 1), thus supporting the temporal stability of MERIT scores across three years of training.
Mean (SD) MERIT factor scores averaged across three years of training were as follows: Factor 1 (personal reflection) = 2.42 (0.41), Factor 2 (systems reflection) = 2.08 (0.45), and Factor 3 (event merit) = 3.63 (0.18) (see Table 1). Repeated-measures ANOVA tests revealed that at least one factor mean score differed from the others (P < .001). Paired t tests were used to examine the differences between each of the three MERIT factor mean scores. The mean (SD) difference between Factor 1 and 2 means was 0.34 (0.37), P < .0001; between Factor 3 and Factor 1 means, 1.21 (0.35), P < .0001; and between Factor 3 and Factor 2 means, 1.55 (0.40), P < .0001. These findings reveal statistically significant differences between all factor means, with Factor 3 (event merit) being the highest and Factor 2 (reflection on systems) the lowest, thus indicating that residents may be least skilled in reflecting on systems.
Reflection score associations
Generalized estimating equations indicated that event preventability was associated with MERIT individual factor mean scores (all P values ≤.01) and overall mean scores (beta = 0.415; 95% CI = 0.186–0.643; P = .0004) (see Table 2). This indicates that the MERIT overall mean score is 0.415 scale points higher for reflection on a preventable event than for reflection on a nonpreventable event. Notably, this substantial 0.415-scale-point difference equals approximately one standard deviation of the MERIT overall mean score. There were no statistically significant associations between resident gender, time when the reflection occurred, or event severity.
We are unaware of previous studies on the temporal stability of residents' reflection scores and associations between these scores and variables related to learners and adverse patient events. This study demonstrates that residents' reflections on improvement opportunities encountered in practice were stable across three years of training, were lower for systems reflection than for personal reflection, and were associated with the preventability of adverse patient events.
This study also adds to the existing validity evidence for MERIT. A prevailing model for validity in education research states that construct validity is supported by evidence from the categories of content, response process, internal structure, criteria, and consequences.19–24 Previous studies demonstrated MERIT content validity based on items that were derived from the literature and expert input, and these same studies demonstrated MERIT internal structure validity based on items clustering into three dimensions and having good internal structure and interrater reliability.18 MERIT score stability over time may provide additional evidence of the internal structure validity19,20,25 because it suggests that reflectiveness is an inherent, unchanging trait and indicates that MERIT reflection scores should be dependable when used in future research at various points in time. Furthermore, temporal stability seems to be uncommonly reported among educational assessments.24 Nonetheless, the finding of reflection score temporal stability could also indicate that resident reflection did not change over time, as might be expected if the residents had received adequate training on how to reflect. This study's major contribution to validity is the finding of an association between reflection scores and adverse event preventability (criterion validity).
Previous research revealed that MERIT systems reflection scores were lower than personal reflection or event merit scores, but the sample size was likely too small to demonstrate a significant difference.18 The current, larger study has shown that systems reflection scores are significantly lower than personal reflection scores. One explanation for this finding is that residents may feel compelled to report adverse events that affected them personally (and, hence, emotionally). This possible explanation is supported by previous research showing relationships between emotion and memory.26–28 An alternative explanation, however, is that our curriculum does not adequately emphasize the importance of systems factors in adverse events. Indeed, a prevailing concept in QI instruction is that systems contributions should not be overlooked.3,6,10,29 Therefore, this study indicates the need for further research on why residents may reflect more on personal than systems aspects of adverse patient outcomes.
In addition, there was no relationship between MERIT reflection scores and gender or years of training. This finding upholds previous research on reflection with medical students, which also showed no associations between reflection and gender or years of training.17 However, our study is unique because it involves reflection on QI among residents.
Also, in this sample, reflection scores were strongly associated with the preventability of adverse patient events. The literature asserts that most medical errors are preventable30–32 and that patients desire full disclosure of medical errors to prevent future recurrences of similar problems.33 Despite this, physicians have been found to “choose their words carefully” when disclosing medical errors, to reduce the risk of litigation.33 Our finding of an association between reflection scores and preventable events may suggest that physicians are especially anxious about missing problems that they could have averted. Our findings also extend on previous studies showing that the situation or context impacts the degree of reflection.17–34
Although this study is unique to residents, we recognize the existence of previous research on reflection among other participants, including nursing students,35 medical students,17 and practicing doctors.36 Notably, these reflection studies did not pertain to reflection on QI. We also acknowledge that an instrument already exists for measuring levels of reflection ranging from habitual action to critical reflection.37 Our study builds on this research by measuring residents' reflections on QI opportunities (reflective thinking), which, in turn, could evolve into meaningful QI initiatives (transformed behaviors).
Finally, this study has limitations. First, it involves a single institution, which may limit generalizability of the findings. Second, it is based on a modest-sized sample; however, a power calculation indicated that any meaningful differences in MERIT reflection scores should have been detected. Third, MERIT requires trained faculty raters to read and score residents' written reflections, which may be less meaningful than having residents rate their own degree of reflectiveness. Fourth, although we are not aware of any differences between residents with complete versus incomplete reflection data, it remains possible that such differences could have influenced the study findings. Finally, this study did not measure other variables that may be associated with residents' reflection scores, such as medical knowledge or resident well-being.
We found that residents' MERIT reflection scores are stable across three years of training, that the quality of systems reflections is lower than personal reflections, and that reflection scores are associated with the preventability of adverse patient events. These findings provide further evidence for the validity of MERIT scores, indicate that residents may feel compelled to report adverse events that affected them personally, suggest that the preventability of adverse patient outcomes has a strong influence on residents' reflectiveness, and expand on previous studies showing that context (e.g., event preventability) impacts the degree of reflection. Future research should examine other variables associated with residents' reflection scores, such as medical knowledge or resident well-being. Future research also should determine effective ways to enhance learners' reflectiveness on the systems aspects of QI and should elucidate further the relationship between QI and event preventability.
This study was supported in part by the Mayo Clinic Internal Medicine Residency Office of Educational Innovations as part of the Accreditation Council for Graduate Medical Education Educational Innovation Project and the Medicine Innovation and Development System from the Mayo Clinic Department of Internal Medicine.
This study was deemed exempt by the Mayo Clinic institutional review board.
These findings were presented at the 2010 Association for Medical Education in Europe meeting in Glasgow, United Kingdom, and at the 2010 Asia Pacific Medical Education Conference in Singapore.
1 Kohn LT, Corrigan JM, Donaldson MS. To Err Is Human: Building a Safer Health System. Washington, DC: National Academy Press; 2000.
3 Boonyasai RT, Windish DM, Chakraborti C, Feldman LS, Rubin HR, Bass EB. Effectiveness of teaching quality improvement to clinicians: A systematic review. JAMA. 2007;298:1023–1037.
5 Ogrinc G, Headrick LA, Morrison LJ, Foster T. Teaching and assessing resident competence in practice-based learning and improvement. J Gen Intern Med. 2004;19:496–500.
7 Hildebrand C, Trowbridge E, Roach MA, Sullivan AG, Broman AT, Vogelman B. Resident self-assessment and self-reflection: University of Wisconsin–Madison five-year study. J Gen Intern Med. 2009;24:361–365.
8 Oyler J, Vinci L, Arora V, Johnson J. Teaching internal medicine residents quality improvement techniques using the ABIM's practice improvement modules. J Gen Intern Med. 2008;23:927–930.
9 Lagerlov P, Loeb M, Andrew M, Hjortdahl P. Improving doctors' prescribing behaviour through reflection on guidelines and prescription feedback: A randomised controlled study. Qual Health Care. 2000;9:159–165.
11 Lachman N, Pawlina W. Integrating professionalism in early medical education: The theory and application of reflective practice in the anatomy curriculum. Clin Anat. 2006;19:456–460.
12 Stern DT, Frohna AZ, Gruppen LD. The prediction of professional behaviour. Med Educ. 2005;39:75–82.
13 Epstein RM, Hundert EM. Defining and assessing professional competence. JAMA. 2002;287:226–235.
14 Stroebel CK, McDaniel RR, Crabtree BF, Miller WL, Nutting PA, Stange KC. How complexity science can inform a reflective process for improvement in primary care practices. Jt Comm J Qual Patient Saf. 2005;31:438–446.
15 Sandars J. The use of reflection in medical education: AMEE Guide No. 44. Med Teach. 2009;31:685–695.
16 Mezirow J. Transformative learning: Theory to practice. In: Cranton P, ed. New Directions for Adult and Continuing Education. Vol 74. San Francisco, Calif: Jossey-Bass; 1997:5–12.
17 Sobral DT. Medical students' mindset for reflective learning: A revalidation study of the reflection-in-learning scale. Adv Health Sci Educ Theory Pract. 2005;10:303–314.
18 Wittich CM, Beckman TJ, Drefahl MM, et al. Validation of a method to measure resident doctors' reflections on quality improvement. Med Educ. 2010;44:248–255.
19 Messick S. Validity. In: Linn RL, ed. Educational Measurement. Phoenix, Ariz: Onyx Press; 1993.
20 Messick S. Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. Am Psychol. 1995;50:741–749.
21 American Educational Research Association; American Psychological Association; National Council on Measurement in Education; Joint Committee on Standards for Educational and Psychological Testing (U.S.). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association; 1999.
22 Downing SM. Validity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830–837.
23 Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: Theory and application. Am J Med. 2006;119:166.e7–e16.
24 Beckman TJ, Cook DA, Mandrekar JN. What is the validity evidence for assessments of clinical teaching?. J Gen Intern Med. 2005;20:1159–1164.
25 Beckman TJ. Determining the validity and reliability of clinical assessment scores. In: Henderson M, ed. A Textbook for Internal Medicine Education Programs. Washington, DC: Association of Program Directors in Internal Medicine and Association of Specialty Professors; 2007:139–146.
26 Kapucu A, Rotello CM, Ready RE, Seidl KN. Response bias in “remembering” emotional stimuli: A new perspective on age differences. J Exp Psychol Learn Mem Cogn. 2008;34:703–711.
27 Henckens MJ, Hermans EJ, Pu Z, Joëls M, Fernández G. Stressed memories: How acute stress affects memory formation in humans. J Neurosci. 2009;29:10111–10119.
28 Miall DS. Emotion and the self: The context of remembering. Br J Psychol. 1986;77:389–397.
29 Varkey P, Reller MK, Resar RK. Basics of quality improvement in health care. Mayo Clin Proc. 2007;82:735–739.
30 Benavidez OJ, Gauvreau K, Jenkins KJ, Geva T. Diagnostic errors in pediatric echocardiography: Development of taxonomy and identification of risk factors. Circulation. 2008;117:2995–3001.
31 Forster AJ, Murff HJ, Peterson JF, Gandhi TK, Bates DW. The incidence and severity of adverse events affecting patients after discharge from the hospital. Ann Intern Med. 2003;138:161–167.
32 Baker GR, Norton PG, Flintoft V, et al. The Canadian Adverse Events Study: The incidence of adverse events among hospital patients in Canada. CMAJ. 2004;170:1678–1686.
33 Gallagher TH, Waterman AD, Ebers AG, Fraser VJ, Levinson W. Patients' and physicians' attitudes regarding the disclosure of medical errors. JAMA. 2003;289:1001–1007.
34 Boud D, Walker D. Promoting reflection in professional courses: The challenge of context. Stud Higher Educ. 1998;23:191–206.
35 Wong FK, Kember D, Chung LY, Yan L. Assessing the level of student reflection from reflective journals. J Adv Nurs. 1995;22:48–57.
36 Aukes LC, Geertsma J, Cohen-Schotanus J, Zwierstra RP, Slaets JP. The development of a scale to measure personal reflection in medical practice and education. Med Teach. 2007;29:177–182.
37 Kember D, Leung D, Jones A, et al. Development of a questionnaire to measure the level of reflective thinking. Assess Eval Higher Educ. 2000;25:381–395.