Share this article on:

The Quality Improvement Knowledge Application Tool Revised (QIKAT-R)

Singh, Mamta K. MD, MS; Ogrinc, Greg MD, MS; Cox, Karen R. RN, PhD; Dolansky, Mary RN, PhD; Brandt, Julie PhD; Morrison, Laura J. MD; Harwood, Beth MEd; Petroski, Greg PhD; West, Al PhD; Headrick, Linda A. MD, MS

doi: 10.1097/ACM.0000000000000456
Research Reports

Purpose Quality improvement (QI) has been part of medical education for over a decade. Assessment of QI learning remains challenging. The Quality Improvement Knowledge Application Tool (QIKAT), developed a decade ago, is widely used despite its subjective nature and inconsistent reliability. From 2009 to 2012, the authors developed and assessed the validation of a revised QIKAT, the “QIKAT-R.”

Method Phase 1: Using an iterative, consensus-building process, a national group of QI educators developed a scoring rubric with defined language and elements. Phase 2: Five scorers pilot tested the QIKAT-R to assess validity and inter- and intrarater reliability using responses to four scenarios, each with three different levels of response quality: “excellent,” “fair,” and “poor.” Phase 3: Eighteen scorers from three countries used the QIKAT-R to assess the same sets of student responses.

Results Phase 1: The QI educators developed a nine-point scale that uses dichotomous answers (yes/no) for each of three QIKAT-R subsections: Aim, Measure, and Change. Phase 2: The QIKAT-R showed strong discrimination between “poor” and “excellent” responses, and the intra- and interrater reliability were strong. Phase 3: The discriminative validity of the instrument remained strong between excellent and poor responses. The intraclass correlation was 0.66 for the total nine-point scale.

Conclusions The QIKAT-R is a user-friendly instrument that maintains the content and construct validity of the original QIKAT but provides greatly improved interrater reliability. The clarity within the key subsections aligns the assessment closely with QI knowledge application for students and residents.

Dr. Singh is associate professor of medicine, Division of General Medicine, Louis Stokes Veterans Affairs Medical Center, Case Western Reserve University, Cleveland, Ohio.

Dr. Ogrinc is associate professor of community and family medicine and of medicine, VA Medical Center, White River Junction, Vermont, and Geisel School of Medicine, Hanover, New Hampshire.

Dr. Cox is manager, Quality Improvement, Office of Clinical Effectiveness, University of Missouri Health Care, Columbia, Missouri.

Dr. Dolansky is associate professor, Frances Payne Bolton School of Nursing, Case Western Reserve University, Cleveland, Ohio.

Dr. Brandt is associate director of quality improvement, School of Medicine, University of Missouri, Columbia, Missouri.

Dr. Morrison is currently director of palliative medicine education, Department of Medicine, Yale University School of Medicine, New Haven, Connecticut, but was at Baylor College of Medicine in the Division of Geriatrics at the time of this study.

Ms. Harwood is research associate, Geisel School of Medicine, Hanover, New Hampshire.

Dr. Petroski is assistant professor of biostatistics, School of Medicine, University of Missouri, Columbia, Missouri.

Dr. West is biostatistician, Department of Veterans Affairs, VA Medical Center, White River Junction, Vermont.

Dr. Headrick is senior associate dean for education and professor of medicine, School of Medicine, University of Missouri, Columbia, Missouri.

Funding/Support: This material is based on work supported by the Department of Veterans Affairs Office of Health Services Research and Development grant EDU08-426 and the use of facilities and material at the White River Junction Veterans Affairs Hospital in White River Junction, Vermont, the University of Missouri School of Medicine in Columbia, Missouri, and the Louis Stokes Cleveland Veterans Affairs Hospital in Cleveland, Ohio.

Other disclosures: None reported.

Ethical approval: The Dartmouth College Committee for the Protection of Human Subjects approved this work.

Disclaimers: This report represents the work of the authors alone and does not necessarily represent the views of the U.S. Department of Veterans Affairs.

Correspondence should be addressed to Dr. Singh, Louis Stokes VA Medical Center, Case Western Reserve University, EUL, 2M680, 10701 East Blvd., Cleveland, OH 44016; telephone: (216) 791-2300 ext. 2326; e-mail:

Since 2003, the Accreditation Council for Graduate Medical Education1,2 has required practice-based learning and improvement (PBLI) and systems-based practice as two of six core competencies for resident physicians. The American Board of Medical Specialties also requires these two competencies for board certification in all specialties.3 Additionally, the Robert Wood Johnson Foundation4 and the QSEN (Quality and Safety Education in Nursing) Institute5 have each published recommendations and curricula to teach quality improvement (QI) to health pro fessional learners, and in 2012, the Association of American Medical Colleges recommended a set of competencies for faculty educators in QI and patient safety.6 These competencies, recommendations, and guidelines high light the increasing importance of QI education. They have also generated substantial interest in designing, implementing, and evaluating curricula for QI. Review articles about teaching QI have described medical training programs and curricula, identified common elements across these programs and curricula, and recommended important next steps.7–9 Each review’s authors have underscored current challenges with evaluating learner QI competence and argued for better instru ments to assess learner achievement in QI.

Over the last decade, various QI assessment tools have surfaced, each measuring specific components of QI education. For instance, the Quality Improvement Project Assessment Tool assesses the structure, content, and strength of an initial QI proposal.10,11 The Systems Quality Improvement and Assessment Tool evaluates PBLI self-efficacy, knowledge, and application skills in resident learners12 and can help guide PBLI residency curricula.13 Surveys measuring resident self-reported attitudes about PBLI and QI project implementation have proven to be a useful way for educators to measure achievement of curricular objectives.14 The Systems Thinking Scale measures systems thinking in the context of QI, whereas the Team Check-up Tool measures the QI intervention context itself.15,16 The Quality Improvement Knowledge Application Tool (QIKAT), originally described in 2003 and 2004, has been used to assess the results of an internal medicine elective rotation for residents in QI.17,18

The QIKAT consists of three short descriptions of scenarios. Each depicts a system-level quality problem. The respondent is required to read the scenario and supply a free-text response consisting of an aim, a measure, and one focused change for a QI effort that addresses the system-level issue raised in the scenario. The QIKAT thus assesses an individual’s ability to decipher a quality problem within a complex system and propose an initiative for improvement. This capacity of QIKAT, coupled with its straightforward administration and its ability to measure QI knowledge application close to curricular interven tions, resulted in the widespread use of the QIKAT across disciplines and developmental learning stages. It has been used to assess QI learning in medical school curricula19,20; in interprofessional education21; in internal medicine,22 psychiatry,23,24 and family medicine residencies25; and in a preventive medicine fellowship.26 When used by expert scorers who resolve discrepancies and agree on a final score, the QIKAT has demonstrated very good content, construct, and predictive validity.20

Although many have found the QIKAT useful, its scoring system has limited its widespread adoption. The original scoring system assigned a score from 1 to 5 to each scenario; each score was based on a value assessment from the scorer (Box 1). This scoring system is subjective and has inconsistent reliability.17 Some QI educators have proposed amended scoring systems, but these have only moderately improved the reliability across scorers.22 To make the QIKAT more widely usable, we hoped to create a revised, easy-to-use scoring rubric to assess application of QI learning. We report our efforts to develop and assess the validity of the “QIKAT-R” here.

Back to Top | Article Outline

Box 1 The Original Quality Improvement Knowledge Application Tool (QIKAT) Scoring System Cited Here...

When scoring, please consider the following factors:

  • Do the answers incorporate improvement fundamentals (customer focus, process knowledge, small tests of change/PDSA)?
  • Do the three elements (Aim, Measure, Change) bear some relationship to each other?

Use the guide below to assign scores:

  • 0 = no response
  • 1 = attempted, but way off the mark; highest possible score if only one element addressed
  • 2 = needs substantial modification; elements unrelated; highest possible score if only two elements addressed
  • 3 = good; needs modification; elements poorly related
  • 4 = very good; needs minimal modification; elements related
  • 5 = excellent; no modification needed; elements clearly related

Abbreviation: PDSA indicates Plan–Do–Study–Act.

Back to Top | Article Outline


The revision of the QIKAT scoring rubric occurred in three phases from 2009 through 2012. A national, inter professional group, consisting of QI educators from Baylor University College of Medicine (at the time L.J.M.), Case Western Reserve University School of Medicine (M.K.S.) and School of Nursing (M.D.), Geisel School of Medicine at Dartmouth (G.O.), and the University of Missouri–Columbia School of Medicine (L.A.H., J.B.) and School of Nursing (K.R.C.) collaborated on this development and validity assessment project. We had either participated in the original QIKAT development effort or used the tool extensively in our courses at our respective institutions. The Dartmouth College Committee for the Protection of Human Subjects approved this work. The study progressed through the following three phases:

  • Phase 1—development of a revised grading rubric with consistent language and elements
  • Phase 2—internal pilot testing to assess for inter- and intrarater reliability and to evaluate the QIKAT-R’s ability to distinguish poor, fair, and excellent responses
  • Phase 3—external testing of interrater reliability and the QIKAT-R’s ability to distinguish poor, fair, and excellent responses
Back to Top | Article Outline

Phase 1—Development

The iterative process of creating specific items for each response element occurred during a series of teleconferences. The expert consensus regarding criteria for each element stemmed from the subsections of the original QIKAT—Aim, Measure, Change—which together represent the three major elements of the Model for Improvement, a core paradigm within health care improvement.27–29 In addition to group discussions and asynchronous editing, seven of us (M.K.S., G.O., K.R.C., M.D., J.B., L.J.M., and L.A.H.) applied the scoring rubric to historical QIKAT responses. Through this process, we tested the instructions and the usability of the revised rubric. After each cycle, we modified the rubric on the basis of the feedback. This process resulted in a structured nine-point scale for each scenario (three points for each subsection; see Box 2) that replaced the original five-point scale. The total possible score for a three-scenario QIKAT-R would be 27 instead of the prior 15. We identified specific components of Aim, Measure, and Change that, rather than assess the global goodness of the responses, constituted appropriate responses that scorers could evaluate with a dichotomous “yes” or “no” answer (Box 2).

Back to Top | Article Outline

Box 2 The Scoring Rubric for the Quality Improvement Knowledge Application Tool Revised (QIKAT-R)a Cited Here...



Back to Top | Article Outline

Phase 2—Pilot testing

The goal for Phase 2 was to assess validity and inter- and intrarater reliability within the research group. Two of us (K.R.C. and M.K.S.), experts in QI education with experience in administering and scoring the original QIKAT, examined a pool of responses from students at the University of Missouri–Columbia. We identified responses from 4 different scenarios involving different specialties (orthopedics, anesthesia, radiology, and nephrology) that fit one of three levels of quality—“excellent,” “fair,” and “poor”—resulting in 12 total scenarios.

We entered these 12 responses into SurveyMonkey as model QI responses and sent them to five individuals (G.O., J.B., L.A.H., L.J.M., and M.D.) to score using the QIKAT-R rubric. We randomized the order for presenting the four scenarios for each scorer, and the scorers were blinded to the level of quality of the scenarios. After six months, the five scorers then rescored the same 12 scenarios, again shuffled, to assess intrarater agreement.

We used concordance correlation coefficients (CCC) to assess intra- and interrater agreement for each item, for the three subscales, and for the total score. The interrater CCC “is a measure of agreement based on the average of multiple readings from each rater,”30 whereas the intrarater CCC is an estimate of the proportion of total variance attributable to differences in student responses to the scenarios. The CCC ranges from zero to one, where, using the SAS macrocode, a score closer to 0 represents less agreement and a score closer to 1 represents more agreement.31 We assessed the CCCs for the overall nine-point score and for each three-point subsection score.

Back to Top | Article Outline

Phase 3—External testing

On the basis of the findings from Phase 2, we expanded testing of the new scoring rubric to faculty from the United States, Canada, and Australia. We expanded testing of the new QIKAT-R scoring rubric to 18 faculty who had used the original QIKAT in their QI teaching or curricula but had not received training for the new QIKAT-R. After recruiting these faculty via e-mail, we sent them the new scoring rubric and the same 12 scenario responses used in Phase 2 through SurveyMonkey. Again, we randomly shuffled the scenario responses for each scorer. We provided instructions in the survey but did not provide specific training for using the new rubric. We collected not only the QIKAT-R scores but also qualitative feedback about the new rubric and how it might be improved (not reported).

We completed our analysis using SAS statistical software (version 9.2; Cary, North Carolina). We calculated our intraclass correlations using a macro written by Robert M. Hamer.32 All responses were rated by the same raters whom we assumed represented a random selection of all possible raters.

Back to Top | Article Outline


Phase 1—Development

The new rubric contains nine items (Box 2): three items under the three subsections Aim, Measure, and Change. This nine-point scale provides standard descriptions within each subsection, and the descriptions for each item have integrated accepted QI content and language; for example, the three Aim items focus on the specificity of the aim or goal, including whether it addresses a system-level problem, specifies a direction of change, and includes a time frame—parameters that are consistent with a specific, measurable, achievable, realistic, and time-bound, or “SMART,” goal.33 Similarly, the Measure items assess whether the measure is related to and consistent with the aim, is available for analysis over time (e.g., with a statistical process control chart), and captures a key process or outcome. The Change items assess whether the change is related to the Aim statement, whether it can be implemented with existing resources (is feasible), and whether it is described in sufficient detail to initiate a test of change consistent with improvement methodology.2729

Back to Top | Article Outline

Phase 2—Pilot testing

In Phase 2, the average scores of the five graders for each of the 12 scenarios demonstrated that the QIKAT-R was able to discriminate among excellent, fair, and poor responses. For the anesthesia scenario, the mean score for an excellent student response was 7.8 (out of 9), whereas a fair response was 5.6, and a poor response received an average score of 4.0. For the orthopedic scenario, the average score for an excellent, fair, and poor responses was, respectively, 7.6, 5.8, and 4.2. Similarly, for the radiology scenario, the average score for an excellent, fair, and poor response was, respectively, 8.4, 7.2, and 2.6, and for the nephrology scenario the mean was 8.6, 4.0, and 3.6. There was clear distinction between excellent and poor responses, but less discrimination between “fair” and “poor” responses. Furthermore, we found that the excellent responses happened to be written by students who had taken a QI course, whereas the poor responses happened to be written by students who had not taken the course.

The overall intrarater agreement coefficient (0 = low to 1 = high) for the nine-point rubric between the test and retest after six months was 0.766. The interrater agreement among the five scorers was also high at 0.794 (Table 1). Further assessment of each subsection (Aim, Measure, or Change), as well as each element within each subsection, also demonstrated agreement, but the agreement was not as strong. The overall interrater agreement for the Aim subsection was strong (0.884), but lower for both Measure (0.572) and Change (0.606). Intrarater agreement showed a similar pattern for each subsection. Each specific item under each subsection also showed variability in intra- and interrater agreement. For example, in the Aim subsection, the interrater agreement scores ranged from 0.221 for A1 (“The Aim is focused on the system level of the problem presented”) to 0.979 for A3 (“The Aim includes at least one specific characteristic such as magnitude or time frame”). We noted this pattern for interrater agreement scores for the Measure and Change subsections as well; the range for the Measure items was 0.261 to 0.431, and for the Change items it was 0.530 to 0.769.

Table 1

Table 1

Back to Top | Article Outline

Phase 3—External testing

In Phase 3, 18 of the 20 invited scorers completed the entire survey; one completed a few parts, and one did not return a response. We analyzed the data on the 18 complete surveys. The 18 scorers represented an international group from several different professions, including internal medicine, pediatrics, nursing, and pharmacy.

The average QIKAT-R scores for the excellent quality responses ranged from 6.7 (out of a possible score of 9) for the orthopedic scenario to 8.5 for the radiology scenario (Table 2). The average poor response scores ranged from 1.9 (radiology) to 4.8 (nephrology). For each scenario, the average scores were statistically different between the poor and excellent responses (P < .0001, Table 2). Additionally, for the anesthesia and radiology scenarios, the average scores were statistically different among all three levels of responses.

Table 2

Table 2

We used intraclass correlation coefficients (ICCs) to assess the interrater agreement across all 18 scorers. Similar to Phase 2, we also assessed the subsections (Aim, Measure, Change) across the four scenarios (Table 1). The ICC was 0.66 for the overall rubric; the ICC was stronger for the Aim subsection (0.711) than for either Measure (0.488) or Change (0.446). For the individual items, the ICC was weakest for specific item A1 (“Aim is focused on the system level of the problem presented”) at 0.085 and the highest for A3 (“Aim includes at least one specific characteristic such as magnitude or time frame”) at 0.838. Again, we noted this pattern for the Measure and Change subsections as well; the range for the Measure items was 0.156 to 0.369 and 0.245 to 0.309 for the Change items (Table 1).

Back to Top | Article Outline

Discussion and Conclusions

Through development, pilot testing, and external testing, we have created and validated an easy-to-use scoring rubric: the QIKAT-R. This revised rubric, which will be available to the public through MedEdPORTAL's DREAM collection, features a nine-point scale, uses dichotomous responses, and demonstrates good psychometric qualities. The QIKAT-R successfully discriminates excellent from fair and poor responses, and it performs well even with different scorers from various geographic locations and health professions. It also performs similarly across scenarios representing four different medical specialties, and thus its use is not limited to any given specialty.

The nine-point scale, developed through an iterative process, includes key components of the three subsections (Aim, Measure, Change) of the Model for Improvement.27–29 The items on the QIKAT-R represent the consensus of an expert panel of QI educators who chose items that they considered to be the most important for foundational QI knowledge (Box 2). Limiting each subsection to three items makes the instrument focused and easy to use. The scenarios used for the QIKAT-R are the scenarios we used initially in our work in 2004.18 They are intended to be generic enough that a QI educator could write and use different scenarios pertinent to a particular specialty (e.g., surgery, psychiatry, radiology) or profession (e.g., medicine, nursing, pharmacy). Our experience suggests that the instrument does not require prior training for scorers who have experience in QI education. In fact, many QI educators have written and used specialty-specific23–26 and medical student scenarios19 for the original QIKAT scoring system. We anticipate that the QIKAT-R will be equally adaptable.

In both Phase 2 and 3 of the study, the instrument discriminated between excellent and poor/fair responses. The QIKAT-R’s ability to distinguish between such groups is promising and suggests that this new rubric has construct validity. Furthermore, at Phase 2 the overall nine-point interrater (between five scorers) and intrarater (six-month retest) agreement for the QIKAT-R in Phase 2 was, respectively, 0.794 and 0.766, showing strong correlation. When expanded to 18 external scorers (Phase 3) who received no training with the instrument, the interrater agreement remained strong at 0.66 for the entire nine-point scale (we did not test intrarater agreement for all 18 external scorers). This high CCC score demonstrates that the revised rubric has good reliability even without formal training. The entire nine-point scale is the most important level of assessment as it demonstrates whether a learner can apply all QI components in a cohesive set of answers. Additionally, our incidental finding that all of our excellent responses were authored by students who had taken a QI class and all of the poor responses were from students who had not taken such a class further highlights the construct validity of the instrument.

The inter- and intrarater reliability scores for the three individual subsections were not as strong. The Aim subsection had the strongest interrater reliability, whereas Measure and Change were lower in both Phase 2 and 3. This finding is consistent with the contextual nature of Measure and Change statements in QI protocols and results compared with Aim statements. For instance, responses to the individual items in Measure and Change may vary according to local contexts, which is likely what led to weaker agreement in the scoring. Predictably, as the subjectivity of each element increased, the score for that individual item agreement decreased. For example, scoring the item “measure readily available so data can be analyzed over time” depends on local data availability, and scoring the item “change proposes to use existing resources” depends on an understanding of local institutional resources. The scenarios do not include this level of detail, so the responder and the scorer fill in this information on the basis of their experience. This subjectivity lends itself to less agreement, especially when scorers are from different professions, institutions, and countries. Educators from a single institution will likely use the QIKAT-R to assess their program and learners, and we suspect that agreement on individual items would be higher in such a situation.

Without strong agreement in the subsections and individual items, it would be challenging to use the subsection scores to evaluate specific parts of a curriculum (e.g., How well are we teaching learners to identify measures?) or to remediate individual learners on specific items (e.g., You had difficulty writing a clear change; let’s focus on improving your skills). Rather, the QIKAT-R is valuable as a strong tool to assess the global application of core QI skills. The three subsections are not intended to stand alone since Aim, Measure, and Change in QI are always interrelated. The relationship between the overall score and scores on the subsections could be assessed in further single-institution studies or studies using limited settings and populations.

One limitation to this study is that five individuals who participated in the consensus review also defined the key concepts represented by the three subsections of the instrument. This redundancy may have led us to miss important points; however, our expert panelists were from multiple institutions and professions, and the iterative process we used for defining the items mandated a thorough and deliberate review of each individual item. Another possible limitation is that QIKAT-R may not perform as well with scorers who have limited QI education experience. One way to address this is to compare results among faculty members with varying levels of experience and to assign final scores through discussion. Finally, we developed the tool for the teaching of core QI elements to novices. Possibly other QI curricula would include different content, thus making this tool inappropriate.

In Phases 2 and 3 of the study, we noted that the instrument discriminated between excellent, fair, and poor responses for the anesthesia and radiology scenarios but only between excellent and fair/poor responses for the orthopedics and nephrology scenarios. In fact, for nephrology, the mean score for the poor level was actually higher than the score for the fair level. This discrepancy is minimal (not statistically significant) and likely arises from multiple sources including the small number of scorers, the details and difficulty of the four scenarios, the quality level assignment of responses, and variability in the QIKAT-R items. Given the instrument’s ability to differentiate between responses at the extremes (excellent and poor), the tool is functional for practical application. Further studies on the refinement of the scale to better differentiate between fair and poor responses is needed, especially if educators hope to use the tool to identify struggling learners.

A particular strength of our study, supporting the validity of our results, is the blinding of our scorers to the levels of quality of responses in Phases 2 and 3. Furthermore, results for Phase 3 scorers representing several countries and professions suggest that the instrument has generalizable applicability. We suspect that the interrater reliability will only improve when QI grader groups are from, as is most likely, the same institution. The QIKAT-R is a user-friendly instrument with strong validity and good reliability. This nine-point scale with binary responses allows for an objective assessment of QI knowledge application across a variety of contexts. This instrument represents a significant step forward in assessing the application of QI knowledge in health professions education and may become an important tool for assessing the application of foundational QI knowledge.

Back to Top | Article Outline


1. Accreditation Council for Graduate Medical Education. . ACGME Common Program Requirements. 2013 Accessed July 2, 2014
2. Accreditation Council for Graduate Medical Education and American Board of Medical Specialties. . Toolbox of Assessment Methods. 2000 Accessed July 2, 2014
3. American Board of Medical Specialties. . Maintenance of certification: Competencies and criteria. Accessed June 25, 2014
4. Robert Wood Johnson Foundation. . Improving the science of continuous quality improvement program and evaluation: A RWJF national program. 2012 Accessed June 25, 2014
5. QSEN Institute. . Competencies. Accessed June 25, 2014
6. Headrick LA, Baron RB, Pingleton SK, et al. Teaching for Quality: Integrating Quality Improvement and Patient Safety Across the Continuum of Medical Education: Report of an Expert Panel. 2013 Washington, DC Association of American Medical Colleges Accessed June 25, 2014
7. Boonyasai RT, Windish DM, Chakraborti C, Feldman LS, Rubin HR, Bass EB. Effectiveness of teaching quality improvement to clinicians: A systematic review. JAMA. 2007;298:1023–1037
8. Wong BM, Levinson W, Shojania KG. Quality improvement in medical education: Current state and future directions. Med Educ. 2012;46:107–119
9. Windish DM, Reed DA, Boonyasai RT, Chakraborti C, Bass EB. Methodological rigor of quality improvement curricula for physician trainees: A systematic review and recommendations for change. Acad Med. 2009;84:1677–1692
10. Leenstra JL, Beckman TJ, Reed DA, et al. Validation of a method for assessing resident physicians’ quality improvement proposals. J Gen Intern Med. 2007;22:1330–1334
11. Wittich CM, Reed DA, Drefahl MM, et al. Relationship between critical reflection and quality improvement proposal scores in resident doctors. Med Educ. 2011;45:149–154
12. Tomolo AM, Lawrence RH. Development and preliminary evaluation of practice based learning improvement for assessing residence competence and guiding curriculum development. J Grad Med Educ. 2011;3:41–48
13. Tomolo AM, Lawrence RH, Watts B, Augustine S, Aron DC, Singh MK. Pilot study evaluating a practice-based learning and improvement curriculum focusing on the development of system-level quality improvement skills. J Grad Med Educ. 2011;3:49–58
14. O’Connor ES, Mahvi DM, Foley EF, Lund D, McDonald R. Developing a practice-based learning and improvement curriculum for an academic general surgery residency. J Am Coll Surg. 2010;210:411–417
15. Dolansky M, Moore SM, Singh MK. The Systems Thinking Scale: A Measure of Systems Thinking—A Key Component of the Advancement of the Science of CQI. Accessed June 25, 2014
16. Chan KS, Hsu YJ, Lubomski LH, Marsteller JA. Validity and usefulness of members’ reports of implementation progress in a quality improvement initiative: Findings from the Team Check-up Tool. Implement Sci. 2011;6:115 Accessed July 2, 2014
17. Morrison LJ, Headrick LA, Ogrinc G, Foster T. The Quality Improvement Knowledge Application Tool: An instrument to assess knowledge application in practice-base learning and improvement [abstract]. J Gen Intern Med. 2003;18:S250
18. Ogrinc G, Headrick LA, Morrison LJ, Foster T. Teaching and assessing resident competence in practice-based learning and improvement. J Gen Intern Med. 2004;19(5 pt 2):496–500
19. Ogrinc G, West A, Eliassen MS, Liuw S, Schiffman J, Cochran N. Integrating practice-based learning and improvement into medical student learning: Evaluating complex curricular innovations. Teach Learn Med. 2007;19:221–229
20. Hall LW, Headrick LA, Cox KR, Deane K, Gay JW, Brandt J. Linking health professional learners and health care workers on action-based improvement teams. Qual Manag Health Care. 2009;18:194–201
21. Ladden MD, Bednash G, Stevens DP, Moore GT. Educating interprofessional learners for quality, safety and systems improvement. J Interprof Care. 2006;20:497–505
22. Vinci LM, Oyler J, Johnson JK, Arora VM. Effect of a quality improvement curriculum on resident knowledge and skills in improvement. Qual Saf Health Care. 2010;19:351–354
23. Reardon CL, Ogrinc G, Walaszek A. A didactic and experiential quality improvement curriculum for psychiatry residents. J Grad Med Educ. 2011;3:562–565
24. Arbuckle MR, Weinberg M, Cabaniss DL, et al. Training psychiatry residents in quality improvement: An integrated, year-long curriculum. Acad Psychiatry. 2013;37:42–45
25. Tudiver F, Click IA, Ward P, Basden JA. Evaluation of a quality improvement curriculum for family medicine residents. Fam Med. 2013;45:19–25
26. Varkey P, Karlapudi SP. Lessons learned from a 5-year experience with a 4-week experiential quality improvement curriculum in a preventive medicine fellowship. J Grad Med Educ. 2009;1:93–99
27. Morrison LJ, Headrick LA. Teaching residents about practice-based learning and improvement. Jt Comm J Qual Patient Saf. 2008;34:453–459
28. Langley GJ, Nolan KM, Nolan TW. The foundation of improvement. Quality Progress. 1994;27:81–86
29. Langley GJ, Moen R, Nolan TW, Norman CL, Provost LP The Improvement Guide: A Practical Approach to Enhancing Organizational Performance. 1996 San Francisco, Calif Jossey-Bass
30. Lin L, Hedayat AS, Wu W. A unified approach for assessing agreement for continuous and categorical data. J Biopharm Stat. 2007;17:629–652
31. Lin L, Hedayat AS, Wu W Statistical Tools for Measuring Agreement. 2012 New York, NY Springer
32. Hamer RM. Compute six intraclass correlation measures. Accessed July 2, 2014
33. Doran GT. There’s a S.M.A.R.T. way to write management’s goals and objectives. Manage Rev. 1981;70:35–36
© 2014 by the Association of American Medical Colleges