PERLC Evaluation Working Group Design
The Centers for Disease Control and Prevention provides funding for 14 Preparedness and Emergency Response Learning Centers (PERLCs) forming a PERLC network across the United States. The PERLCs provide training to state, local, and tribal public health authorities within self-defined service areas to meet workforce development needs of partners in the areas of public health preparedness and response, specialized training, education, and consultation. Using Donald Kirkpatrick's training evaluation model,1 the PERLC network established 4 evaluation working groups to address each level of the model. Group membership was drawn from PERLC evaluation expertise from across the nation, representing many disciplines of evaluation (eg, education, industrial/organizational psychology, public health, etc) spanning decades of practice and academic experience. Participants self-selected into specific Kirkpatrick-level groups according to self-identified areas of expertise. Over the past several years, PERLC evaluators have adopted the basic schema of Kirkpatrick's 4-level model and applied its structure to a wide variety of trainings integral to the PERLC mission.
The primary purpose of the working groups was to inform and promote center-level and program-level evaluation across the PERLC network; identify common training evaluation methods and measures; and share materials, resources, and lessons learned with state, local, and tribal public health organizations for potential replication. Utilizing an overarching evaluation model across the PERLC network proves to be valuable, as it highlights the importance of working within a network and the overall impact of the PERLC program.
Kirkpatrick Model for Training Evaluation
The evaluation of education and training, irrespective of its modality (eg, in-person, online, webinars, seminars, symposia), can be accomplished using Kirkpatrick's 4-level taxonomy.1 The 4 levels aim to measure the following: (1) reaction, or the measure of how trainees feel about aspects of the training program; (2) learning, or the measure of knowledge acquired, skills improved, and/or attitudes changed because of training; (3) behavior change, or the measure of the extent to which trainees practice behaviors as a result of the training; and (4) results or the measure of impact.
Measures assessing level 1 frequently include summary counts of trainees and satisfaction surveys. Level 2 measures are typically evaluated by pre- and posttests specific to content and trainee perceptions of increased knowledge and skills. However, in some cases, only posttests are feasible. As will be discussed in more detail in this article, level 2 is often measured retrospectively by asking trainees to self-assess perceived knowledge gain. Assessment tools for level 3 measures utilize self-reported changes in practice behaviors as well as supervisor observations of change that can be attributed to the training. Level 3 measures are often challenging to develop and implement because they assess changes after-the-fact. Finally, level 4 assessments often consist of methods to document organizational changes that are attributed to the training. Because level 4 assessments consist of higher-level, organizational, or systemwide changes, they are the most complicated and resource intensive to assess.
The following is an overview of the 4 levels as developed by Kirkpatrick, as well as a crosswalk of how these levels relate to emergency preparedness training, drills, and exercises.
Level 1: Reaction
Introduction to level 1
The measures in level 1 consist of trainee reaction surveys to capture perceptions of training. The fundamental tenet of level 1 is that if trainees are not satisfied with training or do not feel that the training is useful to them, they will not be motivated to learn or be able to utilize the training content. Level 1 serves as a necessary, but incomplete, step in the learning process. If failure occurs in level 1, it will ultimately result in an unsuccessful training program. Furthermore, dissatisfaction with existing training can lead to reduced enrollment in future courses, hindering potential for success even before the next training begins.
For the PERLC network, level 1 is primarily focused on satisfaction and goodness of fit with training provided by the PERLCs. However, since the PERLCs provide training using a variety of delivery modalities (eg, online, in-person, written, audio), with varying durations (ie, ranging from 1-hour webinars to multiday workshops), and differing methodologies (eg, didactic training, exercises), it is challenging to develop a common set of measures that would fit all PERLC training activities and could be used to generate aggregate data to share with program funders and stakeholders.
In 2011, the PERLC Evaluation Working Group developed 5 common metrics that address the following constructs for trainees who have completed the training: course satisfaction, perceived knowledge gain, relevance of subject matter to response role, relevance of subject matter to role in daily job, and recommendations to others that have been used to assess all PERLC training efforts (see Table for items and response scales). Below is a brief description and rationale for each of the 5 constructs.
- Course satisfaction: General satisfaction is an essential core measure of the extent to which a trainee likes or approves of a course.
- Perceived knowledge gain: While it may not be thought of as a traditional measure of satisfaction, trainees' perception of learning is a facet within level 1. It is important to note that there is a substantial difference between actual knowledge gain (ie, Kirkpatrick's level 2) and perceived knowledge gain. That is, knowledge gain is an objective measure of learning, typified by pre- and posttests of factual information or performance of a specific skill. Conversely, perceived knowledge gain is a more complex construct that includes thoughts about utility, appropriateness, and level of difficulty, as well as self-perceptions of learning. As a subjective measure, perceived knowledge gain is highly susceptible to halo effects (ie, if trainees like the training, then all of their ratings are likely to be higher than they would be otherwise).
- Relevance of subject matter to response role and job role: For this construct, relevance of subject matter was divided into 2 distinct items because public health emergency responders typically have 2 distinct sets of responsibilities. Few public health emergency responders are full-time emergency responders; therefore, most public health professionals have jobs they typically perform on a daily basis. In fact, many public health professionals never shift into their responder roles except for training and preparedness exercises. Accordingly, it is essential to separate these 2 roles into primary and response roles. As adult learners, public health professionals learn better when presented with materials that have an apparent practical use. If trainees feel that what they are learning will help them on the job, in either their primary or their response role, they will likely be more satisfied with training. Assessing relevance of training to work is an essential element of goodness of fit.2
- Recommendations to others: The final measure of level 1 satisfaction is perceived usefulness of the training for others. Perceived utility is a standard item that draws upon a high-level assessment of whether the training was worthwhile. Logically, if a trainee is willing to recommend a given training to others, then it can be assumed that the trainee judged the training to be worthy of his or her time and effort.
Reaction to training is typically evaluated using quantitative, Likert-scale questions or qualitative open-ended items.3 Questions can be administered through a paper and pencil instrument or online using a Web-based survey application (eg, SurveyMonkey, Qualtrics, LimeSurvey). Relevant to all levels of evaluation, keeping the Centers for Disease Control and Prevention's Six Steps and Relevant Standards in mind can provide a useful guide in terms of what level of evaluation should be pursued.4
Level 2: Learning
Introduction to level 2
Kirkpatrick's level 2 evaluation is perhaps the most immediately recognizable evaluation format. Simply put, the purpose of level 2 is to evaluate the extent to which each trainee has learned from the training or exercise. Although it is simple and straightforward, evaluators typically consider the following as practical aspects of evaluation measures: what kind of tool will be used to evaluate (eg, a written or oral examination), the depth of knowledge expected (eg, Bloom's taxonomy for learning),5 , 6 the mode of evaluation (eg, paper or audience response software), and the timing of the evaluation (eg, same day as the training or after a series of training events). In this section, we describe the methodology by which level 2 evaluation is developed and applied to PERLC training.
Level 2 evaluation of PERLC activities should be linked to the 17 competencies (and 172 subcompetencies) within the 4 core domains, all stratified by knowledge, skills, and attitudes learning type,7 and/or Target Capabilities/Critical Tasks.5 , 6 , 8 However, because instructional design is outside the scope of this discussion, the focus will instead be on measuring knowledge specified in the learning objectives that typically accompany all courses, exercises, and so forth. The focus of level 2 is on measured knowledge gain.
Measuring knowledge gain
Knowledge gain is most accurately measured via pre- and postassessments of actual knowledge. This can be accomplished via multiple choice tests, scenario-based quizzes, discussion questions, and so forth. However, in many cases, pretests are not feasible due to cost, access to trainees, or the nature of the training (eg, full-scale exercises). In some instances, postassessments are the only viable option.
Limitations, barriers, and challenges to level 2 evaluation
Level 2 evaluation is commonly conducted in most training environments and evaluators are typically familiar with effective techniques for evaluating knowledge gain.9 The primary barrier encountered by the PERLCs in conducting level 2 evaluation is typically depth and veracity of the evaluation technique applied. In every situation, the simplest form of level 2 evaluation is perception of knowledge gain. In fact, in this discussion of Kirkpatrick's model,1 the case is made that perception of knowledge gain more properly belongs in level 1 as a measure of satisfaction. Accordingly, it can be a struggle, especially for short duration trainings, to move evaluators and content developers to work together to develop objective, pre-post measures of knowledge gain.10 An additional challenge is coordinating efforts between the trainer and the evaluator. The larger the scope of the training program, the greater the number of trainers delivering material and trainings to be evaluated. Therefore, frame-of-reference training may be required to ensure standardization across trainings.11
Level 3: Behavior change
Introduction to level 3
Evaluation of behavior change in the field setting for public health preparedness and emergency response is based on having a sound knowledge of the training objectives, content, techniques used in the training, and findings from prior evaluation measures of levels 1 and 2.
Behavior change specifically refers to the transfer of knowledge, skills, capabilities, competencies, and so forth, from the training environment (eg, the in-person or online classroom) to the field setting. For the purpose of this discussion, field setting is any applied environment. The continuum of behaviors includes work behaviors in everyday venues; behaviors that emerge during exercises; and/or behaviors that emerge during participation in a real event. The purpose of most public health emergency preparedness training is to prepare individuals for nonroutine emergencies, hoping that eventually responders will feel as confident responding to nonroutine emergencies as routine ones. For example, when weaponized anthrax was first identified as a potential public health threat, the discovery of weaponized anthrax was a nonroutine emergency. There were no plans in place and public health personnel were not trained to respond to anthrax. In the past decade, plans have been created, exercises conducted, and training performed, all working together to help public health to better prepare for a weaponized anthrax response. Accordingly, anthrax is now more routinized, less of a nonroutine emergency. From a funding and performance perspective, it is important to assess and document the longer-term benefits that trainees can demonstrate and/or describe as a result of their participation.
Guiding principles and other considerations for level 3
- Trainings vary in intensity and duration. Not all trainings are designed to produce behavior change (eg, focused on awareness). The degree of behavior change is likely to be associated with the intensity and duration of the training.
- Evaluating behavior change as part of a training program is encouraged. Training programs include multiple sessions over time where the same competency is repeatedly targeted. Repeated targeting of a competency over time is more likely to lead to behavior change.
- There should be an alignment between training and field setting objectives. The capabilities required in a real-world event should align with those taught and practiced during training.
- To evaluate individual behavior changes over time, the same individuals would need to be followed from the training environment to the field setting.
- There must be sufficient resources and continuity to follow individuals from the training environment to the field setting.
- Measures to assess individual achievement of competencies should align with the training objectives.
- Rigorous quantitative measures are preferred for determining desired changes (effect sizes). Barriers to using quantitative measures should be assessed in advance.
Limitations, barriers, and challenges to level 3 evaluation
Along with the aforementioned considerations that can introduce complications if not adequately addressed, there are a number of other elements that can limit the usefulness, accuracy, and affordability of longer-term evaluation. Careful consideration should be given to which measures are used to assess behavior over time. To be successful, specific metrics should be valid and reliable and have sufficient sensitivity to detect changes in behavior and competencies prior to and following training. Participant loss to follow-up may hamper obtaining rigorous results. In addition, many full-scale exercises and drills require intense preparation, require the use of trained experts in the field as well as skilled observers, and may extend over several days. The costs to cover the time, materials, and additional personnel who may be needed to substitute for those who are participating in the exercises can be exceedingly high.
Level 4: Results
Introduction to level 4
Kirkpatrick's level 4 focuses on systematic and organizational changes that result from the trainee's performance. Historically, evaluating programs at level 4 of the Kirkpatrick's model has been challenging for preparedness and emergency response training programs. Few programs assessed in the literature have attempted to capture evaluation at this level. From those that have, several common themes emerged:
- Consider the costs and benefits of attempting to measure level 4 results: While information from level 4 results may be useful for public health organizations, it is much more difficult to measure true effects of training because of the nature of systems level changes.
- Utilize a control group that does not receive the intervention: If an organization chooses to measure level 4 results from training directly, the strongest possible evaluation design would include a control group that did not undergo the training (an untreated control group design).
- Collect information on indicators both prior to and after the intervention: If a control group is not feasible, the next best design would assess outcomes both prior to and after the training without a control group.
- Collect data longitudinally: In reviews of the Kirkpatrick Evaluation Framework,1 authors repeatedly emphasized the need to allow time for developments to occur. Because systemic changes require more time for results to be observable, data collected immediately after training will not be as informative for assessing training outcomes.
- Utilize multiple quantitative and qualitative methods for collecting data: One theme that emerged from level 4 evaluation was the utilization of multiple types of data to assess outcomes. Data collection strategies in the literature included surveys, measurements of quantitative results (eg, revenue or waste reduction), interviews, focus groups, exercise results (tabletop, drill, and functional), external observation, modeling of projected outcomes, and expert panel discussions.
- Obtain data from multiple sources: Investigators also universally used data from multiple sources involved in the training program. In addition to data from trainees, information can also be obtained from program faculty, mentors, coaches, and organizational supervisors.
- Measure consistency between different sources of data: When multiple sources of data are used, using quantitative methods to assess consistency in responses can help individuals ascertain the degree of similarity.
It is problematic to correlate organization-level results directly to trainings because of the following characteristics: the amount of time that passes before results are seen, the number of potential confounding factors in an organizational setting, and the difficulty in measuring many of the “soft indicators” that may emerge from a particular training. Umble et al12 determined that measuring direct correlation of training with level 4 outcomes was best accomplished by creating evaluation instruments that were highly specific to the evaluated training.
Traditional level 4 evaluation has focused on the effects of training on an organization's bottom line. In the context of emergency preparedness and response, we conceptualized the bottom line in the context of public organizations such as local health departments. As such, the bottom line must include not only costs and productivity measures of staff but also the general functioning of the organization with regard to its relationship to the surrounding community or jurisdiction for which it has responsibility. Given the public environment in which local health departments and other emergency preparedness and response organizations operate, it is recommended that at a minimum the following items with suggested indicators be assessed for level 4: (1) staff level of the organization with individual indicators such as stress level, job satisfaction, and educational attainment. These data can also be aggregated to provide a general picture of how staff training has impacted the organization; (2) organizational-level indicators, such as the amount of time needed to identify an emergency and amount of time to respond to an emergency; and (3) community-level indicators, such as the number of partners and quality of partnerships as well as general community satisfaction.
Limitations, barriers, and challenges to level 4 evaluation
Level 4 evaluation is inherently difficult to conduct. It is resource intensive, difficult to develop, and often complex to carry out. However, as increasingly more examples emerge of successful level 4 assessments, the process will become more easily understood and more manageable for the next generation of evaluators.
As a group, many of the PERLC evaluators had conducted evaluation of public health preparedness and response training for a number of years. Although various models and approaches had been considered, none had been officially adopted. With the need for a mechanism to aggregate training outcomes to demonstrate, at a minimum, breadth of coverage, the PERLC Evaluation Working Group undertook the study and implementation of the Kirkpatrick's model for evaluation of training for adult learners. The group divided itself into 4 subgroups to study the Kirkpatrick levels and developed best approaches for its implementation. While the intent was to develop a common set of items that evaluators could use for each level, it quickly became apparent that this could be done only with level 1 reaction items because of the diversity and variability of the PERLC trainings offered. As previously addressed, 5 common level 1 items were adopted for use in all preparedness training evaluations. A review of the other 3 levels reveals the following common themes and multifaceted evaluation approaches on the basis of the Public Health Emergency Response (PHER) competencies and knowledge, skills, and attitudes:
- As a prerequisite, evaluators need a strong background in statistics and research methods.
- If success at level 1 is not achieved, the likelihood of measuring immediate learning and behavior changes declines.
- The more closely the evaluation assesses the objectives of the training, the greater the likelihood that training outcomes can be detected.
- Pre- and postdesigns in which the individual or organization serves as the baseline provide a better sense of outcomes than do postdesigns alone. At a minimum, having a control group that has not received the training to compare with the intervention group would be useful, however, not possible in most situations as agencies aim to train all responders. To assess behavior change, it is advisable to conduct evaluations at a later point in time allowing for behaviors to evolve posttraining. However, the greater the timespan between pre- and postassessment, the more difficult it will be to detect training-related changes because of intervening variables.
- To evaluate behavior changes in the field as well as in organizations, being able to maintain contact with training participants is critical. Participant loss to follow up may negatively impact the utility of training responses.
- The use of robust quantitative measures is preferred to other measures to ensure that effect sizes can be determined. Likert items can be useful with all 4 Kirkpatrick levels. However, the integration of qualitative approaches (eg, open-ended items, focus groups, key informant interviews) may complement what has been learned through quantitative measures.
- Although evaluation at levels 2 through 4 can become increasingly expensive, keeping the Centers for Disease Control and Prevention's Six Steps and Relevant Standards in mind when approaching any level may be useful in assessing whether or not to move forward and with which type of approach.
As a working group, PERLC evaluators were able to map out approaches to all 4 levels of Kirkpatrick's training evaluation model. Recommendations can be applied by evaluators of preparedness and emergency response training and related activities and other areas of program evaluation.
1. Kirkpatrick DL, Kirkpatrick JD. Implementing the Four Levels: A Practical Guide for Effective Evaluation of Training Programs. San Francisco, CA: Berrett-Koehler; 2007.
2. Hites L, Altschuld J. Understanding quality: a guide for developers and consumers of public health emergency preparedness training. Public Health Rep. 2010;125(5):33–42.
3. Weng LJ. Impact of the number of response categories and anchor labels on coefficient alpha and test-retest reliability. Educ Psychol Meas. 2004;64(6):956–972.
4. Centers for Disease Control and Prevention. Framework for program evaluation in public health. MMWR Morb Mortal Wkly Rep. 1999;48(No. RR-11):1–41.
6. Bloom BS, Englehart MB, Furst EJ, Hill WH, Krathwohl DR. Taxonomy of Educational Objectives, the Classification of Educational Goals—Handbook 1: Cognitive Domain. New York, NY: McKay; 1956.
7. Knowledge, Skills, and Attitudes (KSAs) for the public health preparedness and response core competency model. Association of Schools of Public Health and Centers for Disease Control and Prevention. http://www.asph.org/userfiles/ksa.pdf
. Accessed September 2012.
8. Skinner RL. A Review of the Top Officials 3 Exercise. Washington, DC: Department of Homeland Security, Office of the Inspector General, Office of Inspections and Special Reviews; 2005. OIG-06-07. http://www.oig.dhs.gov/assets/Mgmt/OIG_06-07_Nov05.pdf
. Accessed October 1, 2013.
9. Skeff KM, Stratos GA, Bergen MR. Evaluation of a medical faculty development program: a comparison of traditional pre/post and retrospective pre/post self-assessment ratings. Eval Health Prof. 1992;15:350–366.
10. Chiaburu DS, Huang JL, Hutchins HM, Gardner RG. Trainees' perceived knowledge gain unrelated to the training domain: the joint action of impression management and motives. Int J Train Dev. 2013;18(1):1–16.
11. Woehr DJ. Understanding frame-of-reference training: the impact of training on the recall of performance information. J Appl Psychol. 1994;79(4):525.
12. Umble K, Baker EL, Diehl SJ, et al. An evaluation of the National Public Health Leadership Institute—1991-2006: part II. Strengthening public health leadership networks, systems, and infrastructure. J Public Health Manag Pract. 2011;17:214–224.