Since 2003, the Accreditation Council for Graduate Medical Education1,2 has required practice-based learning and improvement (PBLI) and systems-based practice as two of six core competencies for resident physicians. The American Board of Medical Specialties also requires these two competencies for board certification in all specialties.3 Additionally, the Robert Wood Johnson Foundation4 and the QSEN (Quality and Safety Education in Nursing) Institute5 have each published recommendations and curricula to teach quality improvement (QI) to health pro fessional learners, and in 2012, the Association of American Medical Colleges recommended a set of competencies for faculty educators in QI and patient safety.6 These competencies, recommendations, and guidelines high light the increasing importance of QI education. They have also generated substantial interest in designing, implementing, and evaluating curricula for QI. Review articles about teaching QI have described medical training programs and curricula, identified common elements across these programs and curricula, and recommended important next steps.7–9 Each review’s authors have underscored current challenges with evaluating learner QI competence and argued for better instru ments to assess learner achievement in QI.
Over the last decade, various QI assessment tools have surfaced, each measuring specific components of QI education. For instance, the Quality Improvement Project Assessment Tool assesses the structure, content, and strength of an initial QI proposal.10,11 The Systems Quality Improvement and Assessment Tool evaluates PBLI self-efficacy, knowledge, and application skills in resident learners12 and can help guide PBLI residency curricula.13 Surveys measuring resident self-reported attitudes about PBLI and QI project implementation have proven to be a useful way for educators to measure achievement of curricular objectives.14 The Systems Thinking Scale measures systems thinking in the context of QI, whereas the Team Check-up Tool measures the QI intervention context itself.15,16 The Quality Improvement Knowledge Application Tool (QIKAT), originally described in 2003 and 2004, has been used to assess the results of an internal medicine elective rotation for residents in QI.17,18
The QIKAT consists of three short descriptions of scenarios. Each depicts a system-level quality problem. The respondent is required to read the scenario and supply a free-text response consisting of an aim, a measure, and one focused change for a QI effort that addresses the system-level issue raised in the scenario. The QIKAT thus assesses an individual’s ability to decipher a quality problem within a complex system and propose an initiative for improvement. This capacity of QIKAT, coupled with its straightforward administration and its ability to measure QI knowledge application close to curricular interven tions, resulted in the widespread use of the QIKAT across disciplines and developmental learning stages. It has been used to assess QI learning in medical school curricula19,20; in interprofessional education21; in internal medicine,22 psychiatry,23,24 and family medicine residencies25; and in a preventive medicine fellowship.26 When used by expert scorers who resolve discrepancies and agree on a final score, the QIKAT has demonstrated very good content, construct, and predictive validity.20
Although many have found the QIKAT useful, its scoring system has limited its widespread adoption. The original scoring system assigned a score from 1 to 5 to each scenario; each score was based on a value assessment from the scorer (Box 1). This scoring system is subjective and has inconsistent reliability.17 Some QI educators have proposed amended scoring systems, but these have only moderately improved the reliability across scorers.22 To make the QIKAT more widely usable, we hoped to create a revised, easy-to-use scoring rubric to assess application of QI learning. We report our efforts to develop and assess the validity of the “QIKAT-R” here.
Box 1 The Original Quality Improvement Knowledge Application Tool (QIKAT) Scoring System Cited Here...
When scoring, please consider the following factors:
- Do the answers incorporate improvement fundamentals (customer focus, process knowledge, small tests of change/PDSA)?
- Do the three elements (Aim, Measure, Change) bear some relationship to each other?
Use the guide below to assign scores:
- 0 = no response
- 1 = attempted, but way off the mark; highest possible score if only one element addressed
- 2 = needs substantial modification; elements unrelated; highest possible score if only two elements addressed
- 3 = good; needs modification; elements poorly related
- 4 = very good; needs minimal modification; elements related
- 5 = excellent; no modification needed; elements clearly related
Abbreviation: PDSA indicates Plan–Do–Study–Act.
The revision of the QIKAT scoring rubric occurred in three phases from 2009 through 2012. A national, inter professional group, consisting of QI educators from Baylor University College of Medicine (at the time L.J.M.), Case Western Reserve University School of Medicine (M.K.S.) and School of Nursing (M.D.), Geisel School of Medicine at Dartmouth (G.O.), and the University of Missouri–Columbia School of Medicine (L.A.H., J.B.) and School of Nursing (K.R.C.) collaborated on this development and validity assessment project. We had either participated in the original QIKAT development effort or used the tool extensively in our courses at our respective institutions. The Dartmouth College Committee for the Protection of Human Subjects approved this work. The study progressed through the following three phases:
- Phase 1—development of a revised grading rubric with consistent language and elements
- Phase 2—internal pilot testing to assess for inter- and intrarater reliability and to evaluate the QIKAT-R’s ability to distinguish poor, fair, and excellent responses
- Phase 3—external testing of interrater reliability and the QIKAT-R’s ability to distinguish poor, fair, and excellent responses
The iterative process of creating specific items for each response element occurred during a series of teleconferences. The expert consensus regarding criteria for each element stemmed from the subsections of the original QIKAT—Aim, Measure, Change—which together represent the three major elements of the Model for Improvement, a core paradigm within health care improvement.27–29 In addition to group discussions and asynchronous editing, seven of us (M.K.S., G.O., K.R.C., M.D., J.B., L.J.M., and L.A.H.) applied the scoring rubric to historical QIKAT responses. Through this process, we tested the instructions and the usability of the revised rubric. After each cycle, we modified the rubric on the basis of the feedback. This process resulted in a structured nine-point scale for each scenario (three points for each subsection; see Box 2) that replaced the original five-point scale. The total possible score for a three-scenario QIKAT-R would be 27 instead of the prior 15. We identified specific components of Aim, Measure, and Change that, rather than assess the global goodness of the responses, constituted appropriate responses that scorers could evaluate with a dichotomous “yes” or “no” answer (Box 2).
Box 2 The Scoring Rubric for the Quality Improvement Knowledge Application Tool Revised (QIKAT-R)a Cited Here...
Phase 2—Pilot testing
The goal for Phase 2 was to assess validity and inter- and intrarater reliability within the research group. Two of us (K.R.C. and M.K.S.), experts in QI education with experience in administering and scoring the original QIKAT, examined a pool of responses from students at the University of Missouri–Columbia. We identified responses from 4 different scenarios involving different specialties (orthopedics, anesthesia, radiology, and nephrology) that fit one of three levels of quality—“excellent,” “fair,” and “poor”—resulting in 12 total scenarios.
We entered these 12 responses into SurveyMonkey as model QI responses and sent them to five individuals (G.O., J.B., L.A.H., L.J.M., and M.D.) to score using the QIKAT-R rubric. We randomized the order for presenting the four scenarios for each scorer, and the scorers were blinded to the level of quality of the scenarios. After six months, the five scorers then rescored the same 12 scenarios, again shuffled, to assess intrarater agreement.
We used concordance correlation coefficients (CCC) to assess intra- and interrater agreement for each item, for the three subscales, and for the total score. The interrater CCC “is a measure of agreement based on the average of multiple readings from each rater,”30 whereas the intrarater CCC is an estimate of the proportion of total variance attributable to differences in student responses to the scenarios. The CCC ranges from zero to one, where, using the SAS macrocode, a score closer to 0 represents less agreement and a score closer to 1 represents more agreement.31 We assessed the CCCs for the overall nine-point score and for each three-point subsection score.
Phase 3—External testing
On the basis of the findings from Phase 2, we expanded testing of the new scoring rubric to faculty from the United States, Canada, and Australia. We expanded testing of the new QIKAT-R scoring rubric to 18 faculty who had used the original QIKAT in their QI teaching or curricula but had not received training for the new QIKAT-R. After recruiting these faculty via e-mail, we sent them the new scoring rubric and the same 12 scenario responses used in Phase 2 through SurveyMonkey. Again, we randomly shuffled the scenario responses for each scorer. We provided instructions in the survey but did not provide specific training for using the new rubric. We collected not only the QIKAT-R scores but also qualitative feedback about the new rubric and how it might be improved (not reported).
We completed our analysis using SAS statistical software (version 9.2; Cary, North Carolina). We calculated our intraclass correlations using a macro written by Robert M. Hamer.32 All responses were rated by the same raters whom we assumed represented a random selection of all possible raters.
The new rubric contains nine items (Box 2): three items under the three subsections Aim, Measure, and Change. This nine-point scale provides standard descriptions within each subsection, and the descriptions for each item have integrated accepted QI content and language; for example, the three Aim items focus on the specificity of the aim or goal, including whether it addresses a system-level problem, specifies a direction of change, and includes a time frame—parameters that are consistent with a specific, measurable, achievable, realistic, and time-bound, or “SMART,” goal.33 Similarly, the Measure items assess whether the measure is related to and consistent with the aim, is available for analysis over time (e.g., with a statistical process control chart), and captures a key process or outcome. The Change items assess whether the change is related to the Aim statement, whether it can be implemented with existing resources (is feasible), and whether it is described in sufficient detail to initiate a test of change consistent with improvement methodology.27–29
Phase 2—Pilot testing
In Phase 2, the average scores of the five graders for each of the 12 scenarios demonstrated that the QIKAT-R was able to discriminate among excellent, fair, and poor responses. For the anesthesia scenario, the mean score for an excellent student response was 7.8 (out of 9), whereas a fair response was 5.6, and a poor response received an average score of 4.0. For the orthopedic scenario, the average score for an excellent, fair, and poor responses was, respectively, 7.6, 5.8, and 4.2. Similarly, for the radiology scenario, the average score for an excellent, fair, and poor response was, respectively, 8.4, 7.2, and 2.6, and for the nephrology scenario the mean was 8.6, 4.0, and 3.6. There was clear distinction between excellent and poor responses, but less discrimination between “fair” and “poor” responses. Furthermore, we found that the excellent responses happened to be written by students who had taken a QI course, whereas the poor responses happened to be written by students who had not taken the course.
The overall intrarater agreement coefficient (0 = low to 1 = high) for the nine-point rubric between the test and retest after six months was 0.766. The interrater agreement among the five scorers was also high at 0.794 (Table 1). Further assessment of each subsection (Aim, Measure, or Change), as well as each element within each subsection, also demonstrated agreement, but the agreement was not as strong. The overall interrater agreement for the Aim subsection was strong (0.884), but lower for both Measure (0.572) and Change (0.606). Intrarater agreement showed a similar pattern for each subsection. Each specific item under each subsection also showed variability in intra- and interrater agreement. For example, in the Aim subsection, the interrater agreement scores ranged from 0.221 for A1 (“The Aim is focused on the system level of the problem presented”) to 0.979 for A3 (“The Aim includes at least one specific characteristic such as magnitude or time frame”). We noted this pattern for interrater agreement scores for the Measure and Change subsections as well; the range for the Measure items was 0.261 to 0.431, and for the Change items it was 0.530 to 0.769.
Phase 3—External testing
In Phase 3, 18 of the 20 invited scorers completed the entire survey; one completed a few parts, and one did not return a response. We analyzed the data on the 18 complete surveys. The 18 scorers represented an international group from several different professions, including internal medicine, pediatrics, nursing, and pharmacy.
The average QIKAT-R scores for the excellent quality responses ranged from 6.7 (out of a possible score of 9) for the orthopedic scenario to 8.5 for the radiology scenario (Table 2). The average poor response scores ranged from 1.9 (radiology) to 4.8 (nephrology). For each scenario, the average scores were statistically different between the poor and excellent responses (P < .0001, Table 2). Additionally, for the anesthesia and radiology scenarios, the average scores were statistically different among all three levels of responses.
We used intraclass correlation coefficients (ICCs) to assess the interrater agreement across all 18 scorers. Similar to Phase 2, we also assessed the subsections (Aim, Measure, Change) across the four scenarios (Table 1). The ICC was 0.66 for the overall rubric; the ICC was stronger for the Aim subsection (0.711) than for either Measure (0.488) or Change (0.446). For the individual items, the ICC was weakest for specific item A1 (“Aim is focused on the system level of the problem presented”) at 0.085 and the highest for A3 (“Aim includes at least one specific characteristic such as magnitude or time frame”) at 0.838. Again, we noted this pattern for the Measure and Change subsections as well; the range for the Measure items was 0.156 to 0.369 and 0.245 to 0.309 for the Change items (Table 1).
Discussion and Conclusions
Through development, pilot testing, and external testing, we have created and validated an easy-to-use scoring rubric: the QIKAT-R. This revised rubric, which will be available to the public through MedEdPORTAL's DREAM collection, features a nine-point scale, uses dichotomous responses, and demonstrates good psychometric qualities. The QIKAT-R successfully discriminates excellent from fair and poor responses, and it performs well even with different scorers from various geographic locations and health professions. It also performs similarly across scenarios representing four different medical specialties, and thus its use is not limited to any given specialty.
The nine-point scale, developed through an iterative process, includes key components of the three subsections (Aim, Measure, Change) of the Model for Improvement.27–29 The items on the QIKAT-R represent the consensus of an expert panel of QI educators who chose items that they considered to be the most important for foundational QI knowledge (Box 2). Limiting each subsection to three items makes the instrument focused and easy to use. The scenarios used for the QIKAT-R are the scenarios we used initially in our work in 2004.18 They are intended to be generic enough that a QI educator could write and use different scenarios pertinent to a particular specialty (e.g., surgery, psychiatry, radiology) or profession (e.g., medicine, nursing, pharmacy). Our experience suggests that the instrument does not require prior training for scorers who have experience in QI education. In fact, many QI educators have written and used specialty-specific23–26 and medical student scenarios19 for the original QIKAT scoring system. We anticipate that the QIKAT-R will be equally adaptable.
In both Phase 2 and 3 of the study, the instrument discriminated between excellent and poor/fair responses. The QIKAT-R’s ability to distinguish between such groups is promising and suggests that this new rubric has construct validity. Furthermore, at Phase 2 the overall nine-point interrater (between five scorers) and intrarater (six-month retest) agreement for the QIKAT-R in Phase 2 was, respectively, 0.794 and 0.766, showing strong correlation. When expanded to 18 external scorers (Phase 3) who received no training with the instrument, the interrater agreement remained strong at 0.66 for the entire nine-point scale (we did not test intrarater agreement for all 18 external scorers). This high CCC score demonstrates that the revised rubric has good reliability even without formal training. The entire nine-point scale is the most important level of assessment as it demonstrates whether a learner can apply all QI components in a cohesive set of answers. Additionally, our incidental finding that all of our excellent responses were authored by students who had taken a QI class and all of the poor responses were from students who had not taken such a class further highlights the construct validity of the instrument.
The inter- and intrarater reliability scores for the three individual subsections were not as strong. The Aim subsection had the strongest interrater reliability, whereas Measure and Change were lower in both Phase 2 and 3. This finding is consistent with the contextual nature of Measure and Change statements in QI protocols and results compared with Aim statements. For instance, responses to the individual items in Measure and Change may vary according to local contexts, which is likely what led to weaker agreement in the scoring. Predictably, as the subjectivity of each element increased, the score for that individual item agreement decreased. For example, scoring the item “measure readily available so data can be analyzed over time” depends on local data availability, and scoring the item “change proposes to use existing resources” depends on an understanding of local institutional resources. The scenarios do not include this level of detail, so the responder and the scorer fill in this information on the basis of their experience. This subjectivity lends itself to less agreement, especially when scorers are from different professions, institutions, and countries. Educators from a single institution will likely use the QIKAT-R to assess their program and learners, and we suspect that agreement on individual items would be higher in such a situation.
Without strong agreement in the subsections and individual items, it would be challenging to use the subsection scores to evaluate specific parts of a curriculum (e.g., How well are we teaching learners to identify measures?) or to remediate individual learners on specific items (e.g., You had difficulty writing a clear change; let’s focus on improving your skills). Rather, the QIKAT-R is valuable as a strong tool to assess the global application of core QI skills. The three subsections are not intended to stand alone since Aim, Measure, and Change in QI are always interrelated. The relationship between the overall score and scores on the subsections could be assessed in further single-institution studies or studies using limited settings and populations.
One limitation to this study is that five individuals who participated in the consensus review also defined the key concepts represented by the three subsections of the instrument. This redundancy may have led us to miss important points; however, our expert panelists were from multiple institutions and professions, and the iterative process we used for defining the items mandated a thorough and deliberate review of each individual item. Another possible limitation is that QIKAT-R may not perform as well with scorers who have limited QI education experience. One way to address this is to compare results among faculty members with varying levels of experience and to assign final scores through discussion. Finally, we developed the tool for the teaching of core QI elements to novices. Possibly other QI curricula would include different content, thus making this tool inappropriate.
In Phases 2 and 3 of the study, we noted that the instrument discriminated between excellent, fair, and poor responses for the anesthesia and radiology scenarios but only between excellent and fair/poor responses for the orthopedics and nephrology scenarios. In fact, for nephrology, the mean score for the poor level was actually higher than the score for the fair level. This discrepancy is minimal (not statistically significant) and likely arises from multiple sources including the small number of scorers, the details and difficulty of the four scenarios, the quality level assignment of responses, and variability in the QIKAT-R items. Given the instrument’s ability to differentiate between responses at the extremes (excellent and poor), the tool is functional for practical application. Further studies on the refinement of the scale to better differentiate between fair and poor responses is needed, especially if educators hope to use the tool to identify struggling learners.
A particular strength of our study, supporting the validity of our results, is the blinding of our scorers to the levels of quality of responses in Phases 2 and 3. Furthermore, results for Phase 3 scorers representing several countries and professions suggest that the instrument has generalizable applicability. We suspect that the interrater reliability will only improve when QI grader groups are from, as is most likely, the same institution. The QIKAT-R is a user-friendly instrument with strong validity and good reliability. This nine-point scale with binary responses allows for an objective assessment of QI knowledge application across a variety of contexts. This instrument represents a significant step forward in assessing the application of QI knowledge in health professions education and may become an important tool for assessing the application of foundational QI knowledge.
7. Boonyasai RT, Windish DM, Chakraborti C, Feldman LS, Rubin HR, Bass EB. Effectiveness of teaching quality improvement to clinicians: A systematic review. JAMA. 2007;298:1023–1037
8. Wong BM, Levinson W, Shojania KG. Quality improvement in medical education: Current state and future directions. Med Educ. 2012;46:107–119
9. Windish DM, Reed DA, Boonyasai RT, Chakraborti C, Bass EB. Methodological rigor of quality improvement curricula for physician trainees: A systematic review and recommendations for change. Acad Med. 2009;84:1677–1692
10. Leenstra JL, Beckman TJ, Reed DA, et al. Validation of a method for assessing resident physicians’ quality improvement proposals. J Gen Intern Med. 2007;22:1330–1334
11. Wittich CM, Reed DA, Drefahl MM, et al. Relationship between critical reflection and quality improvement proposal scores in resident doctors. Med Educ. 2011;45:149–154
12. Tomolo AM, Lawrence RH. Development and preliminary evaluation of practice based learning improvement for assessing residence competence and guiding curriculum development. J Grad Med Educ. 2011;3:41–48
13. Tomolo AM, Lawrence RH, Watts B, Augustine S, Aron DC, Singh MK. Pilot study evaluating a practice-based learning and improvement curriculum focusing on the development of system-level quality improvement skills. J Grad Med Educ. 2011;3:49–58
14. O’Connor ES, Mahvi DM, Foley EF, Lund D, McDonald R. Developing a practice-based learning and improvement curriculum for an academic general surgery residency. J Am Coll Surg. 2010;210:411–417
16. Chan KS, Hsu YJ, Lubomski LH, Marsteller JA. Validity and usefulness of members’ reports of implementation progress in a quality improvement initiative: Findings from the Team Check-up Tool. Implement Sci. 2011;6:115 http://www.implementationscience.com/content/6/1/115
. Accessed July 2, 2014
17. Morrison LJ, Headrick LA, Ogrinc G, Foster T. The Quality Improvement Knowledge Application Tool: An instrument to assess knowledge application in practice-base learning and improvement [abstract]. J Gen Intern Med. 2003;18:S250
18. Ogrinc G, Headrick LA, Morrison LJ, Foster T. Teaching and assessing resident competence in practice-based learning and improvement. J Gen Intern Med. 2004;19(5 pt 2):496–500
19. Ogrinc G, West A, Eliassen MS, Liuw S, Schiffman J, Cochran N. Integrating practice-based learning and improvement into medical student learning: Evaluating complex curricular innovations. Teach Learn Med. 2007;19:221–229
20. Hall LW, Headrick LA, Cox KR, Deane K, Gay JW, Brandt J. Linking health professional learners and health care workers on action-based improvement teams. Qual Manag Health Care. 2009;18:194–201
21. Ladden MD, Bednash G, Stevens DP, Moore GT. Educating interprofessional learners for quality, safety and systems improvement. J Interprof Care. 2006;20:497–505
22. Vinci LM, Oyler J, Johnson JK, Arora VM. Effect of a quality improvement curriculum on resident knowledge and skills in improvement. Qual Saf Health Care. 2010;19:351–354
23. Reardon CL, Ogrinc G, Walaszek A. A didactic and experiential quality improvement curriculum for psychiatry residents. J Grad Med Educ. 2011;3:562–565
24. Arbuckle MR, Weinberg M, Cabaniss DL, et al. Training psychiatry residents in quality improvement: An integrated, year-long curriculum. Acad Psychiatry. 2013;37:42–45
25. Tudiver F, Click IA, Ward P, Basden JA. Evaluation of a quality improvement curriculum for family medicine residents. Fam Med. 2013;45:19–25
26. Varkey P, Karlapudi SP. Lessons learned from a 5-year experience with a 4-week experiential quality improvement curriculum in a preventive medicine fellowship. J Grad Med Educ. 2009;1:93–99
27. Morrison LJ, Headrick LA. Teaching residents about practice-based learning and improvement. Jt Comm J Qual Patient Saf. 2008;34:453–459
28. Langley GJ, Nolan KM, Nolan TW. The foundation of improvement. Quality Progress. 1994;27:81–86
29. Langley GJ, Moen R, Nolan TW, Norman CL, Provost LP The Improvement Guide: A Practical Approach to Enhancing Organizational Performance. 1996 San Francisco, Calif Jossey-Bass
30. Lin L, Hedayat AS, Wu W. A unified approach for assessing agreement for continuous and categorical data. J Biopharm Stat. 2007;17:629–652
31. Lin L, Hedayat AS, Wu W Statistical Tools for Measuring Agreement. 2012 New York, NY Springer
33. Doran GT. There’s a S.M.A.R.T. way to write management’s goals and objectives. Manage Rev. 1981;70:35–36