Journal Logo

Empirical Investigations

Validation of a Detailed Scoring Checklist for Use During Advanced Cardiac Life Support Certification

McEvoy, Matthew D. MD; Smalley, Jeremy C. MD; Nietert, Paul J. PhD; Field, Larry C. MD; Furse, Cory M. MD; Blenko, John W. MD; Cobb, Benjamin G. MD; Walters, Jenna L. MD; Pendarvis, Allen MD; Dalal, Nishita S. MD; Schaefer, John J. III MD

Author Information
Simulation in Healthcare: The Journal of the Society for Simulation in Healthcare: August 2012 - Volume 7 - Issue 4 - p 222-235
doi: 10.1097/SIH.0b013e3182590b07


Defining valid, reliable, defensible, and generalizable standards for the evaluation of learner performance is a key issue in assessing both baseline competence and mastery in medical education.1 Regarding advanced cardiac life support (ACLS), the standards of performance are the published guidelines for patient management. Several studies have previously described the process by which an ACLS checklist can be validated and the process by which minimum passing scores for team leader performance can be determined for ACLS testing.2–5 Furthermore, the processes of checklist development, validation, and reliability testing have been described for other areas of medical training and assessment, including cardiac auscultation, objective structured clinical examinations, and anesthesia simulation scenarios.6–8 However, the literature to date leaves several important questions regarding the reliability of scores yielded from grading tools for ACLS testing unanswered.

First, are these checklists objective and simple enough to be used by nonexpert raters as opposed to faculty facilitators trained in grading simulation performances? Prior publications for some checklists described how professionals with training in simulation research and psychometric analysis tested the checklist being studied.9,10 However, the final end user of these published checklists will likely not have such training. Other publications have demonstrated that nonexpert raters, such as standardized patients or laypersons with no medical training, can use checklists to produce reliable scores that enable valid judgments of trainees.11–13 However, this has not been demonstrated with ACLS checklists. As such, ACLS checklists intended for use in simulation testing should undergo evaluation to ensure that a high level of reliability in scores is obtained when they are used by instructors who are not involved in checklist design and who have not received targeted instruction in simulation training. Second, if the checklists can be used by a nonexpert rater, is the training for proper checklist use of such duration that it could be given routinely to ACLS instructors? Third, is the checklist able to be used during real-time (ie, continuous) evaluation of the team leader managing an American Heart Association (AHA) Megacode, or does its use require additional time for video review after the competency examination? Accordingly, the purpose of this study was to assess the reliability of a checklist used as a continuous grading tool by nonexpert raters during the review of simulations of AHA Megacodes.


This study was submitted to our institutional review board and granted “exempt” status.


For the purposes of this article, the term experienced refers to individuals who have been involved with simulation education, training, and research for at least 3 years and who teach crisis management simulation courses, including ACLS, at least 5 times a year (M.D.M., L.C.F., C.M.F., and J.J.S.). The nonexpert raters in this study were 4 fourth-year medical students (B.G.C., J.L.W., A.P., and N.S.D.). These students had previously received ACLS training and certification via high-fidelity simulation and were familiar with the SimMan (SimMan 3G; Laerdal Medical Corp, Stavanger, Norway) software interface. However, none of the 4 nonexpert raters had received training in SimMan programming, checklist design, or simulation facilitation, grading, or debriefing. None of the 4 students had received training as an ACLS instructor.

Checklist Creation

The ACLS checklists used in this study were constructed following the steps outlined in previous checklist development studies, as well as those from authors experienced in checklist design.1–4,14,15 First, the checklist items were constructed through a detailed review of the “2005 American Heart Association Guidelines for Cardiopulmonary Resuscitation and Emergency Cardiovascular Care,” thus assuring content validity.16 Second, this content was evaluated by the group of faculty experts using a modified Delphi technique to determine the exact form of the items on the checklist.17 Third, the checklists were then divided into several portions, including the initial evaluation of the patient, items specific to the 3 patient conditions managed during the Megacode simulation [eg, pulseless electrical activity (PEA), stable tachycardia, ventricular fibrillation (VFIB)], and a section on common/possible errors. Fourth, 4 final checklists were constructed that would represent specific checklists to be used during the grading of Megacode scenarios programmed within the SimMan software (Appendices 1–4). These Megacode scenarios contained the following patient state sequences:

  1. Unstable bradycardia → VFIB → asystole,
  2. Stable tachycardia → pulseless ventricular tachycardia (PVT) → PEA,
  3. Unstable bradycardia → PVT → asystole, and
  4. Stable tachycardia → VFIB → PEA.

After the checklists were completed for content and order of the items, each item was then further categorized by the faculty experts in a number different ways: correct/incorrect actions, objective/subjective checklist item, and critical/noncritical action. Items were first categorized as a correct or incorrect action. These varied for each checklist, depending on the patient’s state. For example, defibrillation is a correct step in the VFIB protocol, but an incorrect step in the PEA or asystole pathways. Each of the checklist consisted of 63 to 68 correct items and 32 to 34 incorrect items. Items were also classified as subjective or objective, depending on whether any interpretation was needed by the rater. The examples of objective items involved discrete actions, such as “defibrillated at 200 J biphasic” or “gave epinephrine 1 mg [intravenously].” The examples of subjective items often involved assessment and communication steps, such as “assessed reversible causes of arrest,” “assessed patient stability,” and “assigned team member roles.” These steps are more complex because assessed reversible causes of arrest could be done by reviewing them out loud with the team, thinking about them without verbalizing the condition in the differential, or a combination of the 2 methods. For example, the team leader may order a treatment for hyperkalemia without actually stating hyperkalemia in the differential. This can also be true for assessed patient stability because asking the patient certain questions is important in this step (eg, “Are you short of breath? Do you have chest pain?”), as is assessing the electrocardiogram tracing and the vital signs, and this complex assessment can be done in a variety of combinations. Thus, whereas rules were given during the training period to account for this variety in the subjective items, grading these items often required some subjectivity on the part of the rater. The variations in the number of items, the correct or incorrect nature of each item, the objective or subjective nature of each item, and whether the item was deemed “critical” can be seen in the full checklists shown in Appendices 1 to 4.

Creation of Videos Used for Rater Training and Checklist Analysis

Four training videos of simulated ACLS Megacodes were created for use in the checklist training curriculum for the nonexpert raters. One video was made for each of the 4 checklists (Appendices 1–4). After this, 8 ACLS Megacode simulation videos were created so that scores yielded from the checklists could be tested for reliability. Two videos were made for each checklist. The scenarios were all programmed within the SimMan software and executed using the SimMan 3G manikins. All scenarios were executed in our simulation laboratory in a patient care room that duplicates an inpatient hospital room at our institution and with a code cart replicating those in use in our hospital. The team leader performance during the videos was scripted to include both correct and incorrect actions, so that the raters would not be evaluating only expert performance. The team leaders in the training videos (n = 4) were ACLS-certified faculty members at our institution, and the team leaders in the testing videos (n = 8) were ACLS-certified resident volunteers in our anesthesiology training program. The code team members (ie, confederates) in both video sets consisted of trained simulation staff members who played the roles of 2 nurses performing cardiopulmonary resuscitation, a nurse managing the defibrillator, a pharmacist, and an airway manager. All of the videos were roughly 10 to 12 minutes. Video recording of the scenario management was performed using the B-Line system (SimCapture; B-Line Medical, LLC, Washington, DC).

After the creation of both sets of videos, 2 experienced faculty members (M.D.M. and J.J.S.) graded the performance in the videos using the checklists. This was done individually and then together to have a reference standard performance rating to which the evaluations of the nonexpert raters could be compared. Of note, the experienced raters provided very similar ratings of the video performances. The most problematic items were those considered subjective, such as assessed reversible causes of the arrest. For these few areas, the 2 faculty members watched the videos together and reached an agreement on the grading of the particular discrepant/discordant items. The reference standard was created so that the nonexpert raters would be compared not only to themselves and one another (intrarater and interrater reliability) but also to experienced simulation facilitators because one of the aims of this study was to evaluate whether nonexpert raters could use the checklist as accurately as faculty proficient in assessment and grading of simulation training and ACLS.

Training of Nonexpert Raters

The training of nonexpert raters consisted of several parts. First, they received instruction on the content of the checklists and the meaning of each item. Second, they were given instructions on how to grade items that were classified as subjective. Third, they were given a chance to ask questions about the checklist items and grading criteria. Fourth, the group was shown 1 of the 4 training videos. This video was played continuously and then with pauses to illustrate the 2 methods by which the Megacode performances were to be graded. That is, the nonexpert raters were to practice grading the team leader first in a “continuous” mode (ie, letting the video play continuously without the ability to stop or rewind and grading the team leader performance while viewing the video). Then, they were to practice grading the team leader performance “with pauses” (ie, having the freedom to pause, stop, and/or rewind the video during the grading process). This second method was allowed to make sure that the raters felt confident of their evaluation of every item on the checklist in relation to the performance of the team leader in the video.

After this group session, each nonexpert rater individually graded the 4 training videos, including the one that they had viewed as a group. Each video was graded once continuously and once with pauses. The raters were debriefed on their checklist evaluations individually and as a group. That is, 1 experienced facilitator (M.D.M.) would do a 1:1 debriefing with each nonexpert rater after the rater had completed a grading session. This facilitator clarified questions with respect to grading and discussed areas where videos were graded differently by the expert raters. This helped to clarify grading criteria for the set of 8 videos to be evaluated in the study. For instance, criteria were set for how to grade items such as “stated diagnosis of unstable bradycardia” because there were questions regarding how specific the wording must be from the team leader (ie, did the leader have to state “the diagnosis is unstable bradycardia,” or could various combinations of words be used as long as the leader indicated the correct diagnosis to the team?). Finally, the group met to discuss any final questions about the checklist and grading criteria.

The checklists were presented to the nonexpert raters in Microsoft Excel (Excel 2007; Microsoft, Inc, Seattle, WA) format for review during the training and study periods. They recorded their assessment of the team leader in this format during the training period and during the performance of the study. Their performance evaluations were entered directly into the Excel program on the computer. The Excel spreadsheet was designed to allow most of the checklist items to be seen at any one time (when displayed on a 19-in desktop monitor). More importantly, the items were arranged from top to bottom in a fashion that mirrored the design of the scenarios and thus in the order in which they would likely be assessed. For instance, 1 scenario started in stable tachycardia and then advanced to PVT and then to PEA. In the Excel spreadsheet, the items related to stable tachycardia would be the first set of items on the checklist, then those for PVT, and finally those for PEA (Appendices 1–4). All of the relevant checklist items for any patient state within the scenario (eg, VFIB) could be seen at one time by the rater. Thus, all items being rated at a given time were visually accessible to the rater on the computer screen without the need for scrolling. The total training time on the use of the checklist was approximately 4 hours, including the scoring of the 4 videos used for training on use of the checklists. All of the training was done in a single half-day session.

Checklist Evaluation

The nonexpert graders were then informed about the order in which to evaluate the 8 Megacode simulation videos. This video evaluation sequence was created by a random order generator and was different for each rater. The scenarios were presented to them in random order because each video was to be rated 4 times by each rater—twice in a continuous mode and twice with pauses (Table 1). The purpose of rating the videos in this manner was 2-fold. First, it was to assess whether there was greater reliability in scoring if the rater was able to pause and rewind the video to evaluate items that may have been missed during continuous evaluation (with pauses vs. continuous). Second, the videos were presented in random order to reduce repetition bias. Video evaluations by the nonexpert raters were compared with the expert video evaluation to test whether there was an improvement in agreement with repeated analysis of the videos. This was to test whether the training curriculum was sufficient or if there was an ongoing learning curve with continued use of the checklist. The video scoring was accomplished over 3 days. Day 1 included 2 half days in which the videos were reviewed and scored once in a continuous mode in the morning and then once in a continuous mode in the afternoon. Days 2 and 3 included reviewing and grading the videos with pauses once each day. The with pauses method of grading took about twice as long compared with the continuous mode.

Example of Grading Template for Novice Raters

Statistical Analysis

The interrater and intrarater reliability of the checklist scores was primarily assessed using a generalized linear mixed model (GLMM) approach (see Appendix 5 for a complete description). The first GLMM used individual checklist items as the unit of analysis. This model used binary dependent variables reflecting whether a given rater’s assessment agreed with the expert’s assessment. The GLMM included the following independent variables: scenario, continuous versus with pauses, round (1 or 2), whether the checklist item was a correct/incorrect action, whether the item was subjective/objective, sequential order of the scenario (1 to 7), whether the expert claimed that the item had been performed, and whether the experts viewed each item as critical. A random effect was included to account for the fact that within-rater assessments are likely correlated with one another. The GLMM used a logit link function and was conducted using SAS v9.2 (SAS, Cary, NC). This modeling process is somewhat analogous to the “G theory” approach for evaluating performance assessments on continuous assessment scores.18,19 We also used a separate GLMM to examine interrater and intrarater reliability across all items within a given ACLS Megacode scenario. In other words, we assessed the level of agreement between the raters and the experts across all checklist items (n = ∼100 per scenario, including possible incorrect actions) and modeled that value (using an identity link function) as a function of scoring method (continuous vs. with pauses) and grading round (1 or 2), again using random rater effects to control for the correlated nature of the data. We also summarized agreement using the κ statistic. Shrout and Fleiss20 suggest that κ higher than 0.75 is indicative of excellent agreement, κ between 0.40 and 0.75 is fair to good, and κ lower than 0.40 is considered poor.

In addition to the measures of agreement for individual items within the checklist, we also measured Lin’s21 concordance correlation coefficient (CCC) and the intraclass correlation coefficient (ICC) for the code leader composite scores, as determined by the 4 raters and the reference standard.20,22 Separate composite measures were created for “correct” and “incorrect” actions performed by the code leader (as assessed by the rater). The composite scores for correct actions were determined by summing the total of all possible correct actions performed, divided by the total number of possible correct actions (ie, percent correct), whereas the composite scores for incorrect actions were determined by summing the total number of incorrect actions performed. The CCC ranges from −1.0 to 1.0 and reflects overall agreement between any 2 sets of scores. It is similar to the traditional Pearson correlation coefficient, but the CCC also includes a penalty for any systematic bias. We were able to use CCC values to assess test-retest reliability (comparing raters’ scores of the same scenarios on 2 different occasions) and to assess accuracy (ie, agreement between each rater and the reference standard). The ICC ranges from 0% to 100% and reflects the degree to which score variability is attributable to the code leader as opposed to interrater variability.

Under the assumption of low variability across raters in agreement with the reference standard (which was confirmed in our data), our sample of 4 raters assessed against a reference standard in grading 8 scenarios 4 times each (2 rounds in a continuous mode and 2 rounds with pauses) provided sufficient power (>80%) to determine if there was a greater than 2% interround difference between experienced and nonexpert raters, as well as between nonexpert raters, in the scoring of team leader actions during ACLS Megacode simulations.


An online supplement (see Table 1, Supplemental Digital Content 1, shows a complete listing of raw scores by rater, grading round, and scenario. On average, in the 8 videos used for this study, the team leaders performed correct actions 66.2% of the time (SD, 10.4%; range, 53.7%–82.1%) and performed 2.6 (SD, 2.0; range, 0–6) incorrect actions, as assessed by the reference standard. These performances reflect a range of poor to highly competent performance. Overall, the nonexpert raters agreed with the reference standard on individual checklist items 88.9% of the time. Table 2 shows the results of the primary GLMM model examining independent associations between experimental conditions (factors) with agreement/disagreement with the reference standard on individual checklist items. Agreement with the reference standard varied considerably across the 8 scenarios, with levels of agreement ranging from 83.3% to 91.43%. Agreement was significantly higher on items that were deemed to be incorrect actions versus correct actions, on items that were deemed to be objective versus subjective, and on critical items versus “not critical.” We observed significant (P = 0.03) variation across raters; however, the range of agreement only varied by 1.86 percentage points. No significant variation was observed when comparing the scoring methods, grading rounds, or actions deemed to be performed by the reference standard versus actions deemed not performed. Similar results (ie, significant variability across scenarios, significant but small variability among raters, and no significant differences comparing scoring methods or grading rounds) were observed when the total percent agreement score was used as the dependent variable in the GLMM.

Factors Associated With Agreement/Disagreement With the Reference Standard on Individual Checklist Items: Unadjusted Agreement Levels and P Values From the GLMM Model Reflecting Each Factors Independent Association With Agreement

In our GLMM model examining the existence of any potential learning curve over the 8 scenarios, we found that agreement between the raters and the reference standard did not significantly change from the early assessments to the later. After adjusting for scenario type, scoring method, and grading round, the estimated change in percent agreement from the first scenario to the eighth scenario was 0.2%. Figure 1 illustrates this lack of association, for example, among the nonexpert raters’ round 1 assessments in a continuous mode.

This figure shows the agreement with the reference standard by rater and by the order in which the video was graded during the first round of grading. The 8 videos were graded in different orders by each nonexpert rater. As this figure represents the agreement with the reference standard during the first round of continuous grading, it shows that no significant learning curve existed after the checklist, and video training program was complete.

Details of the level of agreement between raters and the reference standard, as well as κ values summarizing agreement with the reference standard for rater responses on individual items, are provided as an online supplement (see Table 2, Supplemental Digital Content 2, In summary, the overall average κ between the raters and the reference standard was 0.78 across rounds, qualifying as excellent agreement according to Fleiss’s22 criteria. κ Values were relatively consistent across experimental conditions, ranging only from 0.74 to 0.80 for continuous grading for the first and second rounds (mean difference between rounds, 0.01; P = 0.29) and ranging from 0.74 to 0.81 for both rounds of the video grading with pauses (mean difference between rounds, 0.02; P = 0.17). These findings again suggest that the high level of interobserver reliability was not affected by scoring mode (continuous and with pauses). They also illustrate that the level of agreement did not substantially improve with repeated evaluation of the same tapes. There was no significant difference between these values for evaluations conducted in a continuous mode for round 1, a continuous mode in round 2, with pauses in round 1, and with pauses in round 2, wherein the average κ values only ranged from 0.77 to 0.78 (intrarater reliability). When κ values were averaged across rounds and mode of evaluation (continuous vs. with pauses), the variation in agreement between the raters and the reference standard was minimal. In this analysis, the average κ values ranged from 0.76 to 0.80, indicating high agreement with the reference standard overall.

For the composite scores for correct actions, test-retest reliability CCC values were high, averaging 0.96 across the 4 raters and ranging from 0.95 to 0.98. Concordance correlation coefficient values comparing raters’ scores to the reference standard were also quite strong, averaging 0.87 and ranging from 0.83 to 0.94. Across the 4 raters and the reference standard, the ICC was 0.97 for the composite scores for correct actions, indicating extremely low interrater variability. For the composite scores for incorrect actions, test-retest reliability CCC values were moderately high, averaging 0.83 across the 4 raters and ranging from 0.79 to 0.86. Concordance correlation coefficient values comparing raters’ composite scores for incorrect actions to the reference standard were moderately strong, averaging 0.59 and ranging from 0.54 to 0.69. Across the 4 raters and the reference standard, the ICC was 0.92, again indicating low interrater variability.

As noted previously, grading with pauses took about twice as long as continuous grading. We did not record the frequency with which raters changed their initial answers when using the mode with pauses. However, they reported that this occurred at a very low rate.


This study was undertaken to assess the reliability of scores generated from ACLS checklists used as grading tools by nonexpert raters during the review of simulations of AHA Megacodes. Previous studies have shown that ACLS checklists can be used to make valid judgments for the purpose of developing minimum passing score criteria in competency testing.2–5 However, it was noted previously that these studies left several questions unanswered if such checklists are to be used for general use in the setting of ACLS training. The results of this present study illustrate that we have developed a set of checklists that yield reliable scores and that can be used by nonexpert raters with little variability in scores across raters and with a high level of agreement between nonexpert and expert raters. The fact that nonexpert, non-ACLS–certified raters can produce reliable scores using these detailed checklists suggests that these checklists could produce reliable scores for AHA-certified ACLS instructors who would not be considered experts in education or research. Further studies will be needed to elucidate whether ACLS instructors who are already receiving instruction on the use of the AHA checklists could be taught to use this checklist in a similar amount of time. Taken together, these findings suggest that the checklists described in this study could be generalizable for widespread use in ACLS training without substantially increasing the training needed for course instructors with respect to the evaluation of team leader performance during Megacode management. Overall, this study has produced reliability evidence, content validity evidence, and some response-process evidence (raters are able to do this both with and without stops) for the checklists under consideration. In the future, instruments that yield reliable data and enable valid judgments of performance need to be developed for other skills addressed during ACLS courses, such as intubation.

The results of this study also show that a short training curriculum given to the nonexpert raters was sufficient to properly instruct them in the use of the checklist because no improvement in reliability of scores occurred with repeated use of the checklists during the testing phase (Fig. 1). It should be noted that the level of agreement between the raters and the reference standard was higher for objective questions compared with subjective questions. This was a result of the range of communication possibilities of some items that make scoring these items as a binary “yes/no” more difficult. These checklist elements may never reach the same level of reliability in scoring as the measurement of discrete, objective actions. It is well recognized in the anesthesia literature that assessing nontechnical skills, such as situational awareness and team communication, presents a unique challenge.23 Until discrete communication language is considered a requirement for correct Megacode management, such as clearly stating the precise words, “the patient is in x condition” or “let’s evaluate all of the reversible causes of the arrest,” the assessment of team leader communication and patient assessment may retain some element of subjectivity. Therefore, the nature of the checklist item could influence the grading performance for a rater because objective items are less prone to variation in evaluation. Thus, it is possible that this is an area in which training in the use of the checklists could be improved. Additional research will be needed to complete the validity assessment of these checklists, so that they can be used in summative assessments for high-stakes testing.

Additional work needs to be performed to determine the validity of these checklists and to set standards for passing. Methods of setting minimum passing scores in ACLS have been previously described.5 Furthermore, prior studies have shown a link between simulation training and improved adherence to ACLS guidelines in the clinical setting.24 However, this prior work did not show that there was improvement in patient outcome. This could be because of the fact that, even in the group that received high-fidelity simulation (vs. standard teaching), only 68% of actions taken adhered to published guidelines. Yet, studies continue to show that adherence to guidelines does affect patient outcomes.25,26 Therefore, future work will need to address the best methodologies for ensuring a high rate of adherence to guidelines. In addition, this granularity in testing could be performed every few months, rather than at random, to ensure that providers who attend in-hospital cardiac arrests have demonstrated an ability to lead such an event properly.27

Our results demonstrate a high level of agreement between raters and show that overall interrater reliability was not affected by scoring mode. In addition, we demonstrated high intrarater reliability and excellent agreement between nonexpert and expert rater evaluations. However, incorrect actions were scored with less reliability. The level of reliability in scoring incorrect items was still in the range of what is considered “good.” However, this may remain a weakness of our checklist because the correct actions were all arranged in a logical, temporal sequence, and thus, it was probably easier for the rater to follow and score these items. On the other hand, the incorrect actions do not follow a logical sequence, and it is impossible to place them in a predetermined order.

The major limitations of our findings have been discussed previously. However, 3 further points deserve mention. First, it should be noted that these checklists were tested only for team leader evaluation, not the team as a whole. Second, the nonexpert raters recorded their assessment of team leader performance into an Excel spreadsheet while watching a video recording on a computer screen. This may be a much simpler task (keeping the head still and looking between 2 computer screens) than watching a room of several participants in continuous action while also looking up and down at a computer interface to score their performance. Operating the simulator user interface and simultaneously recording participant actions on an Excel spreadsheet are impractical. Therefore, if the content of these checklists is to be used in a real-time (ie, continuous) manner, future research will need to be performed by testing them within a simulator software interface (Fig. 2). Third, the checklists used in this study were based on the 2005 AHA ACLS update. However, the only change in the 2010 update that would affect the checklists would be the removal of atropine administration from the asystole and bradycardic PEA pathways. These changes would not substantively change the reliability assessment in this study, and the checklists could still be applicable for use under the new guidelines.

An example screenshot from the SimMan software interface that will need to be tested in the future to determine if these checklists can be reliably used in this format while simulations are actually occurring.


Our findings advance the knowledge in this arena of educational research in at least 4 key aspects. First, we have produced reliable checklists that yield reliable scores for evaluating correct and incorrect team leader performance during the review of simulated ACLS Megacodes. Second, faculty members experienced in checklist design and ACLS instruction are not needed for the use of these checklists. Third, a short training curriculum on the proper use of the checklist is effective. Fourth, by extension of these first 3 points, these checklists are likely to be appropriate for use in continuous grading of team leader performance during ACLS Megacode skills testing. Future studies need to address the feasibility of using these checklists within a simulator interface during live simulations and the generalizability of these checklists for widespread use in ACLS certifications. These 2 problems would best be addressed by a multisite study where our checklists are used by ACLS instructors during Megacode testing and compared with the grading checklists currently provided by the AHA. Overall, the results of this study indicate that 1 continuous evaluation (by a nonexpert rater) of team leader performance during the review of an ACLS Megacode simulation is adequate to achieve an accurate rating when using this checklist.


The authors thank Drs J. G. Reves and S. T. Reeves for their assistance in manuscript review. The authors thank all of the staff at the MUSC Clinical Effectiveness and Patient Safety Center for their assistance during this project.


1. Wayne DB, Butter J, Cohen ER, McGaghie WC. Setting defensible standards for cardiac auscultation skills in medical students. Acad Med 2009; 84: S94–S96.
2. Wayne DB, Siddall VJ, Butter J, et al.. A longitudinal study of internal medicine residents’ retention of advanced cardiac life support skills. Acad Med 2006; 81: S9–S12.
3. Wayne DB, Butter J, Siddall VJ, et al.. Graduating internal medicine residents’ self-assessment and performance of advanced cardiac life support skills. Med Teach 2006; 28: 365–369.
4. Wayne DB, Butter J, Siddall VJ, et al.. Mastery learning of advanced cardiac life support skills by internal medicine residents using simulation technology and deliberate practice. J Gen Intern Med 2006; 21: 251–256.
5. Wayne DB, Fudala MJ, Butter J, et al.. Comparison of two standard-setting methods for advanced cardiac life support training. Acad Med 2005; 80: S63–S66.
6. Hatala R, Scalese RJ, Cole G, Bacchus M, Kassen B, Issenburg SB. Development and validation of a cardiac findings checklist for use with simulator-based assessments of cardiac physical examination competence. Simul Healthc 2009; 4: 17–21.
7. Tudiver F, Rose D, Banks B, Pfortmiller D. Reliability and validity testing of an evidence-based medicine OSCE station. Fam Med 2009; 41: 89–91.
8. Morgan PJ, Cleave-Hogg D, Guest CB. A comparison of global rating and checklist scores from an undergranduate assessment using an anesthesia simulator. Acad Med 2001; 76: 1053–1055.
9. Müller MJ, Dragicevic A. Standardized rater training for the Hamilton Depression Rating Scale (HAMD-17) in psychiatric novices. J Affect Disord 2003; 77: 65–69.
10. Sevdalis N, Lyons M, Healey AN, Undre S, Darzi A, Vincent CA. Observational teamwork assessment for surgery: construct validation with expert versus non-expert raters. Ann Surg 2009; 249: 1047–1051.
11. Weidner AC, Gimpel JR, Boulet JR, Solomon M. Using standardized patients to assess the communication skills of physicians for the Comprehensive Osteopathic Medical Licensing Examination (COMLEX) level 2–performance evaluation. Teach Learn Med 2010; 22: 8–15.
12. Schmitz CC, Chipman JG, Luxenberg MG, Beilman GJ. Professionalism and communication in the intensive care unit: reliability and validity of a simulated family conference. Simul Healthc 2008; 3: 224–238.
13. Zanetti M, Keller L, Mazor K, et al.. Using standardized patients to assess professionalism: a generalizability study. Teach Learn Med 2010; 22: 274–279.
14. Available at: Accessed February 3, 2011.
15. Stufflebeam DL. Guidelines for Checklist Development and Assessment. Available at: Accessed February 3, 2011.
16. ECC Committee, Subcommittees and Task Forces of the American Heart Association. 2005 American Heart Association Guidelines for Cardiopulmonary Resuscitation and Emergency Cardiovascular Care. Circulation 2005; 112: IV1–IV203.
17. Morgan PJ, Lam-McCulloch J, Herold-McIlroy J, Tarshis J. Simulation performance checklist generation using the Delphi technique. Can J Anaesth 2007; 54: 992–997.
18. Cronbach LJ, Nageswari R, Gleser GC. Theory of generalizability: a liberation of reliability theory. Br J Stat Psychol 1963; 16: 137–163.
19. Shavelson RJ, Webb NM, Rowley GL. Generalizability theory. Am Psychol 1989; 44: 922–932.
20. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979; 86: 420–428.
21. Lin LI. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989; 45: 255–268.
22. Fleiss JL. Statistical Methods for Rates and Proportions. 2nd ed. New York, NY: Wiley-Interscience; 1981.
23. Fletcher G, Flin R, McGeorge P, Glavin R, Maran N, Patey R. Anaesthetists’ Non-Technical Skills (ANTS): evaluation of a behavioural marker system. Br J Anaesth 2003; 90: 580–588.
24. Wayne DB, Didwania A, Feinglass J, Fudala MJ, Barsuk JH, McGaghie WC. Simulation-based education improves quality of care during cardiac arrest team responses at an academic teaching hospital: a case-control study. Chest 2008; 133: 56–61.
25. Chan PS, Krumholz HM, Nichol G, Nallamothu BK. Delayed time to defibrillation after in-hospital cardiac arrest. N Engl J Med 2008; 358: 9–17.
26. Myhre JM, Ramachandran SK, Kheterpal S, et al.. Delayed time to defibrillation after intraoperative and periprocedural cardiac arrest. Anesthesiology 2010; 113: 782–793.
27. Andreatta P, Saxton E, Thompson M, Annich G. Simulation-based mock codes significantly correlate with improved pediatric patient cardiopulmonary arrest survival rates. Pediatr Crit Care Med 2011; 12: 33–38.










There are several statistical methods for assessing the impact of various factors on reliability of a given measurement; generalizability theory (or G theory) is particularly useful for examining the impact of multiple sources of variation and was introduced in 1963 by Cronbach et al (Br J Stat Psychol 1963;16:137–163). G theory uses an analysis of variance (ANOVA) approach to quantify the degree to which individual factors (eg, raters, time, experimental setting) affect measurement reliability. Although this type of approach is ideal for continuous measurements, some adaptations are necessary for binary observations, such as the individual checklist item responses summarized in the current article. To study the impact of item characteristics on interrater and intrarater reliability of the scoring checklist, we used a GLMM approach with a binary response and a logit link function. The GLMMwe used is similar to a logistic regression model; however, the traditional logistic regression model assumes independent errors across observations, whereas our GLMMallowed for correlated errors within observations made by the same rater. Our model can be expressed as:

In the model previously mentioned, pij represents the probability of agreement with the reference standard on item i by rater j, α is the estimated overall intercept, the β’s are estimated regression parameters for the specified fixed-effect variables of interest (see descriptors in Table 1), γj is a random intercept for reader j, and ∈ij is an error term for the ith item as assessed by the jth rater. In this GLMM, we assumed that variation in agreement with the reference standard had a compound symmetry error structure, meaning that assessments made by the same rater being assumed to be correlated with one another and assessments made by different raters being assumed to be independent from one another. We constructed our GLMM using SAS v9.2 (Cary, NC) PROC GLIMMIX.

The GLMM is an ideal model for simultaneously studying a number of sources of variation in raters’ agreement with the reference standard. By being able to incorporate random rater effects rather than fixed rater effects (as would be the case using an ANOVA approach), inference about reliability can be made about a population of raters, not just the specific raters who provided assessments. In other words, the GLMM approach provides greater generalizability than the ANOVA approach. In addition, a likelihood ratio test (using the COVTEST statement) can be used to test whether the variance component associated with the random rater effect is zero, providing a means for formally assessing interrater reliability. Because the “round” variable distinguishes whether a rater’s assessments were made during round 1 versus round 2, its parameter estimate and associated SE provide a means for assessing intrarater reliability over time. Because other factors (eg, questionnaire item characteristics, rater’s viewing order) may also contribute to the rater’s agreement with the reference standard, they are easily added to the model as fixed effects, and F tests based on the associated parameters and SEs are conducted by default.Working with GLMMsgenerally requires knowledge of general linear models (eg, linear regression, ANOVA) but also requires additional knowledge of more advance statistical topics such as link functions, error covariance structures, and numerical quadrature; thus, collaboration with a biostatistician is recommended. McCulloch and Searle (Generalized, Linear, and Mixed Models, 2001) provide an excellent discussion of these topics.


Simulation; Education; Checklist; Reliability; ACLS

Supplemental Digital Content

© 2012 Society for Simulation in Healthcare