We observed nurse-to-nurse morning handoffs and multidisciplinary rounds in a 15-bed surgical ICU at a large academic hospital in the mid-Atlantic United States. Simulated code videos from the same institution were also viewed. These tasks were chosen because they represented both action-oriented (direct patient care, simulated codes) and transition-oriented (planning/establishing goals, handoffs, and rounds) team care. The observability of certain team behaviors is contingent upon the task being observed (21), and the selection of these specific tasks ensured that the reliability and validity of each teamwork competency could be assessed. Furthermore, rounds and handoffs were selected because they were routinely conducted around the same time every day, making it logistically easier for observers to capture performance. Similarly, there are few easily observable team action tasks in the ICU and reviewing retrospective simulated codes provided an easily accessible way to observe an unpredictable and sensitive task.
Two raters (A.S.D., M.A.R) who helped develop the behavioral marker system and had expertise in behavioral measurement conducted all observations. Raters scored each team on subdimensions of teamwork germane to each task, which were informed by previous studies describing competencies expected for action-oriented and transition-oriented tasks (22 , 23), other global competencies (24), and input from clinician team members. Raters initially practiced using the system, observing six handoffs, eight multidisciplinary rounds, and 25 simulated codes, independently rated performance on each, and shared their ratings and rationales. Data collected from these practice observations were not included in the final scoring of our analysis. Practice observations also afforded an opportunity to confirm which teamwork competencies manifested for a given task. For example, we found that although a general plan of care was established during multidisciplinary rounds, the explicit delineation of roles and responsibilities to achieve those goals occurred immediately after rounds.
Raters observed the same 138 instances of teamwork (25 nurse-to-nurse morning handoffs, 88 multidisciplinary rounds, and 25 simulated code exercises) and independently rated performance on the subdimension competencies for each task (illustrated in Appendix 3, Supplemental Digital Content 3, http://links.lww.com/CCM/D991). An instance of teamwork was defined by the interaction among team members for a specific patient for a given task. For example, patient rounds may involve an attending physician, fellow, resident, and a nurse assigned to a patient as well as ancillary staff such as pharmacists, respiratory therapists, and physical therapists. Because the compositional make up of a team can vary across each of these tasks for each patient, one team can only be observed for one task.
Behavioral Marker System Reliability and Validity
We assessed reliability of the behavioral marker rating system in two ways. Interrater reliability was assessed to compare the consistency of scoring between raters (25). Intraclass correlation coefficients (ICCs) were calculated for each task overall and for each subdimension, and both the single measure, comparing agreement of raters on a single score, and an averaged measure of the two scores were reported. An ICC less than 0.40 indicated poor reliability between raters, 0.40–0.59 fair reliability, 0.60–0.74 good reliability, and greater than or equal to 0.75 excellent reliability (26). SPSS v.21 (IBM, Armonk, NY) was used for this analysis. A two-way random-effects model with absolute agreement was used to provide a conservative ICC estimate (27).
Generalizability theory was applied to further examine reliability and provide evidence of construct validity. We chose generalizability theory because it leverages an analysis of variance for behavioral measurement data. It can partition multiple sources of systematic variance (both desirable and undesirable), allowing for a better evaluation of the dependability of a measurement instrument. Traditional reliability testing cannot make this differentiation and treats all measurement error as random variation. Because generalizability theory assesses both desired and undesired variance, it represents a powerful methodologic approach for providing evidence of reliability and validity (28–30).
Generalizability theory has been previously applied in healthcare to validate measurement instruments (17 , 29 , 31). It is leveraged in the present study to estimate the variance in scores attributable to instances of teamwork observed, subdimensions, raters, tasks, and associated interactions (illustrated in Appendix 2, Supplemental Digital Content 3, http://links.lww.com/CCM/D991). A valid measurement system should demonstrate systematic differences in how subdimensions are scored across teams. Therefore, the following pattern of results is expected. The percent of variance associated with the subdimension by instance of teamwork interaction should be the greatest source of variance, followed by the main effects of subdimensions and instances of teamwork. The percent of variance associated with rater effects and task effects should be minimal because large values would represent inconsistent scoring across raters and tasks.
In the present study, analyses were performed using EduG v.6.1 (32), which enumerated each source of variance and calculated a generalizability coefficient. Generalizability coefficients estimated the amount of variance in observed scores attributable to desired sources of variance (e.g., differentiating performance and competencies) compared with undesired/unexpected variance (e.g., rater effects); higher coefficients indicate better measurement systems (33). Relative generalizability coefficients measured the extent to which the marker system could make comparative distinctions (team A performed better than team B), whereas absolute coefficients measured exact differences (team A averaged a 4.2 across subdimensions and team B averaged a 3.6 across subdimensions). A coefficient above 0.80 was used as our cutoff score (32) for acceptable dependability of the marker system, and the percent of variance in scoring attributes is reported for all analyses.
We applied generalizability studies to examine the three tasks for four primary sources of systematic variance: instances of teamwork, rater effects, subdimension effects of the marker system, and task effects. A secondary analysis explored systematic effects associated with the attending physician leading rounds. A power analysis was not appropriate because generalizability studies are not based on hypothesis testing (34). However, the sample sizes reported for each analysis were consistent with previous research (17 , 30 , 35).
All measurement sources were treated as random for analysis to provide a more conservative account of study findings. We used a mixed design to estimate all possible variance in the dataset. Instances of teamwork were nested within the observed tasks (handoffs, rounds, simulated codes) because team composition varied each time a task was performed in the ICU. The nested design prevented estimation of some variance components because of confounding (e.g., the instance of teamwork and task interaction was indistinguishable from the main effect of the task) (36). Additionally, we sought to examine any change in variance when the data for each task were combined compared to when it was analyzed separately. Therefore, six separate generalizability studies were conducted to analyze the patterns of results.
Transition Team Tasks
Our analysis of transition tasks included four sources of variance: instances of teamwork nested in tasks, raters, and six subdimensions (from communication and team decision-making dimensions) relevant to both rounds and handoffs (“analysis 1”). Instances of teamwork for rounds were randomly sampled from a single attending for this analysis to avoid introducing potential confounding associated with leadership effects. We then conducted separate analyses for handoffs (“analysis 2”) and rounds data (“analysis 3”) to analyze for differences when the data for these two tasks were investigated separately. Next, we analyzed all rounding data to explore potential leader effects during rounds (“analysis 4”); 16 instances of teamwork were randomly sampled per leader (n = 4). The variance sources examined were instances of teamwork nested within the attending physicians leading rounds, raters, and subdimensions. In addition to the six subdimensions explored in analyses 1–3, we assessed team norms, and error correction and feedback.
Action Team Task
“Analysis 5” was a generalizability study for the action team task (codes) and included three sources of variance: instances of teamwork, raters, and subdimensions (style, content, and closed-loop communication), task management and delegation, and offering/seeking support.
Global Teamwork Competency
Every subdimension of communication manifested during each task. Therefore, we examined variance associated with this subdimension across each task (“analysis 6”). For this analysis, sources included instances of teamwork, raters, and subdimensions.
Rater scores for each overall task had good correlation for single score comparisons (ICC, 0.69 for rounds; ICC, 0.64 for handoffs; and ICC, 0.62 for simulated codes) and excellent correlation for averaged measures (ICC, 0.81 for rounds; ICC, 0.78 for handoffs; and ICC, 0.76 for simulated codes) (Table 3). Across all tasks, there were seven subdimensions indexed as fair and one as poor. Interrater reliability for communication style showed excellent correlation for rounds (ICC, 0.75 single measure; and ICC, 0.86 average measure) and poor-to-fair correlation for simulated codes (ICC, 0.38 single measure; and ICC, 0.55 average measure).
Validity Testing of Behavioral Marker System
The analysis of variance for each generalizability study is summarized in Appendix 2 (Supplemental Digital Content 3, http://links.lww.com/CCM/D991). Outside of the residual error term (unexplained variance), the interaction of subdimensions and instances of teamwork accounted for the largest proportion of variance for each analysis, meaning subdimensions were differentially scored by raters for each instance of teamwork. The subdimension and instance main effect generally accounted for the second largest source of variance. A notable exception was simulated codes (analysis 5), wherein the subdimension main effect only accounted for 5.8% of the total variance.
Variance due to overall rater effects never surpassed 0.7%, demonstrating minimal systematic differences in the rater’s scores for instances of teamwork, subdimension competencies, and tasks (Appendix 2, Supplemental Digital Content 3, http://links.lww.com/CCM/D991). A relatively large variance (14%) was attributed to one rater systematically scoring some instances of teamwork (averaged over subdimensions) higher than the other rater for simulated codes (analysis 5).
Table 4 presents the generalizability coefficients. The marker system differentiated among subdimensions regardless of rater, task, or instances of teamwork that were observed across all analyses except simulated codes, which approached conventional criteria (0.76 relative, 0.66 absolute). When instances of teamwork were viewed as the only desired source of variance (i.e., were there overall differences in how teams performed regardless of subdimensions?), only analysis 4 approached conventional standards.
This study evaluated the reliability and validity of a behavioral marker system for assessing ICU team performance during multidisciplinary rounds, nurse-to-nurse handoffs, and simulated code events. We found the behavioral marker system to: 1) differentiate teamwork competencies and 2) reliably capture teamwork differences within a particular instance of teamwork. This means that raters were judging performance differently based on the competency they were observing during each instance of teamwork and that each competency represented a unique aspect of teamwork. This finding justifies the use of this tool to capture how a specific team is performing across a variety of competencies relevant to both action and transition tasks. This tool could be used in learning and development to determine where some teams are performing better than others. Furthermore, variance attributable to rater effects across all analyses was marginal.
The marker system did not reliably differentiate between high- and low-performing teams for handoffs and codes unless competencies were applied as a desired source of variation in the generalizability study analysis. Thus, the marker system will have greater utility for providing formative, rather than completely summative, evaluations or assessments.
Additionally, we found that about 30% of variance in each analysis was from residual (unexplained) error. The unexplained error could have stemmed from such factors as experience of team members, patient complexity, and task interruptions. For instance, complex patients generally require more resource and contingency planning, which could influence team behaviors. Future research would benefit from understanding the implications of these factors on the reliability of behavioral measurement.
Although interrater reliability ranged from good to excellent overall, there were seven times where reliability was fair and one case where it was poor. Low reliability values reduce the confidence that raters are consistently scoring the same attribute. They also may underscore the challenges associated with behavioral measurement. Teamwork in the ICU is complex, thereby complicating the rating of teamwork behaviors. To illustrate, a single statement from a clinician could involve behaviors related to updating and revising goals (e.g., the patient did respond to a certain treatment), planning and establishing goals (e.g., consults with outside services and/or additional tests are suggested), and contingency planning (e.g., there are no signs of active bleeding, but that is a situation in need of monitoring). Capturing all this information is a difficult undertaking for raters, especially when the behaviors occur in rapid succession or when there are more members on the team for raters to pay attention to. Costa et al (18), for example, found rating team behaviors in an ICU challenging for similar reasons.
Another key finding was that the observability of specific teamwork competencies varied across team performance contexts. Communication was consistently observed across tasks, but team decision making was mostly observed in transition tasks and back up and supportive behavior in action tasks. We expected leadership to be globally relevant to both action- and transition-oriented team tasks, but leadership behaviors did not manifest during nurse-to-nurse handoffs.
There were limitations to our study. The marker system was used for both in-person and video-recorded observations, and these differences may have influenced our findings. However, code events are unplanned and infeasible to capture in real-time, but a critical period when effective team performance is paramount for patient safety. Biases intrinsic to observational research may have influenced study findings. For instance, direct observation of clinician behavior may have altered that behavior. Raters may have been susceptible to the contrast effect (i.e., comparing a current instance to the previous instance for valuations rather than relying on the behavioral markers) (37). Logistical challenges inherent with care transitions and codes (e.g., opportunities/ability to observe instances of teamwork) limited our ability to have consistent sample sizes across tasks. Additionally, we only tested the marker system in one surgical ICU and our results may not generalize to other types of ICUs or hospitals. Finally, definitions of teams, teamwork, and team competencies vary widely in the Critical Care literature (16) and healthcare more broadly (38). This diversity of terminology and conceptualizations limits the development of assessment tools that can be broadly implemented.
Teamwork skills are essential to provide safe and efficient care in the ICU. Measuring teamwork in critical care environments poses unique challenges, including highly diverse and dynamic team compositions, variability in physical and temporal distributions, and extreme variety in types of team tasks, ranging from highly cognitive and analytical tasks requiring collaborative problem solving to action-oriented procedural and physical tasks requiring behavioral coordination. Our findings support the validity of this tool and its utility for evaluating team performance for multiple task types.
We thank Chris Holzmueller (Armstrong Institute for Patient Safety and Quality, Johns Hopkins University School of Medicine) for her insightful feedback and contributions during the review process.
1. Pham JC, Aswani MS, Rosen M, et al. Reducing medical errors and adverse events. Annu Rev Med 2012; 63:447–463
2. Schmutz J, Manser T. Do team processes really have an effect on clinical performance? A systematic literature review. Br J Anaesth 2013; 110:529–544
3. Weaver SJ, Dy SM, Rosen MA. Team-training in healthcare: A narrative synthesis of the literature. BMJ Qual Saf 2014; 23:359–372
4. Hughes AM, Gregory ME, Joseph DL, et al. Saving lives: A meta-analysis of team training in healthcare. J Appl Psychol 2016; 101:1266–1304
5. Reader TW, Flin R, Mearns K, et al. Developing a team performance framework for the intensive care unit
. Crit Care Med 2009; 37:1787–1793
6. Profit J, Sharek PJ, Amspoker AB, et al. Burnout in the NICU setting and its relation to safety culture. BMJ Qual Saf 2014; 23:806–813
7. Pronovost PJ, Thompson DA, Holzmueller CG, et al. Toward learning from patient safety
reporting systems. J Crit Care 2006; 21:305–315
8. Salas E, DiazGranados D, Klein C, et al. Does team training improve team performance? A meta-analysis. Hum Factors 2008; 50:903–933
9. Neily J, Mills PD, Young-Xu Y, et al. Association between implementation of a medical team training program and surgical mortality. JAMA 2010; 304:1693–1700
11. Flin R, Martin L. Behavioral markers for crew resource management: A review of current practice. Int J Aviat Psychol 2001; 11:95–118
12. Russ S, Hull L, Rout S, et al. Observational teamwork
assessment for surgery: Feasibility of clinical and nonclinical assessor calibration with short-term training. Ann Surg 2012; 255:804–809
13. Fletcher G, Flin R, McGeorge P, et al. Anaesthetists’ Non-Technical Skills (ANTS): Evaluation of a behavioural marker system. Br J Anaesth 2003; 90:580–588
14. Sevdalis N, Lyons M, Healey AN, et al. Observational teamwork
assessment for surgery: Construct validation with expert versus novice raters. Ann Surg 2009; 249:1047–1051
15. Mitchell L, Flin R, Yule S, et al. Evaluation of the scrub practitioners’ list of intraoperative non-technical skills system. Int J Nurs Stud 2012; 49:201–211
16. Dietz AS, Pronovost PJ, Benson KN, et al. A systematic review of behavioural marker systems in healthcare: What do we know about their attributes, validity and application? BMJ Qual Saf 2014; 23:1031–1039
17. Weller J, Frengley R, Torrie J, et al. Evaluation of an instrument to measure teamwork
in multidisciplinary critical care teams. BMJ Qual Saf 2011; 20:216–222
18. Costa DK, Dammeyer J, White M, et al. Interprofessional team interactions about complex care in the ICU: Pilot development of an observational rating tool. BMC Res Notes 2016; 9:408
19. O’Leary KJ, Boudreau YN, Creden AJ, et al. Assessment of teamwork
during structured interdisciplinary rounds on medical units. J Hosp Med 2012; 7:679–683
20. Dietz AS, Pronovost PJ, Mendez-Tellez PA, et al. A systematic review of teamwork
in the intensive care unit
: What do we know about teamwork
, team tasks, and improvement strategies? J Crit Care 2014; 29:908–914
21. Dietz AS, Rosen MA, Wyskiel R, et al. Development of a behavioral marker system to assess intensive care unit
team performance. Proc Hum Factors Ergon Soc Annu Meet 2015; 59:991–995
22. Marks MA, Mathieu JE, Zaccaro SJ. A temporally based framework and taxonomy of team processes. Acad Manag Rev 2001; 26:356–376
23. LePine JA, Piccolo RF, Jackson CL, et al. A meta-analysis of teamwork
processes: Tests of a multidimensional model and relationships with team effectiveness criteria. Pers Psychol 2008; 61:273–307
24. Cannon-Bowers JA, Tannenbaum SI, Salas E, et al. Defining competencies and establishing team training requirements. In: Team Effectiveness and Decision Making in Organizations. 1995, pp Salas E, Guzzo RA (Eds). San Francisco, CA, Jossey-Bass, 333–380.
25. Tinsley HE, Weiss DJ. Interrater reliability and agreement of subjective judgments. J Couns Psychol 1975; 22:358–376
26. Cicchetti DV. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess 1994; 6:284–290
27. Shrout PE, Fleiss JL. Intraclass correlations: Uses in assessing rater reliability. Psychol Bull 1979; 86:420–428
28. Brennan RL. Generalizability Theory. 2001New York, NY, Springer.
29. Crossley J, Marriott J, Purdie H, et al. Prospective observational study to evaluate NOTSS (Non-Technical Skills for Surgeons) for assessing trainees’ non-technical performance in the operating theatre. Br J Surg 2011; 98:1010–1020
30. Kraiger K, Teachout M. Generalizability theory as construct-related evidence of the validity of job performance ratings. Hum Perform 1990; 3:19–35
31. Moonen-van Loon JM, Overeem K, Govaerts MJ, et al. The reliability of multisource feedback in competency-based assessment programs: The effects of multiple occasions and assessor groups. Acad Med 2015; 90:1093–1099
32. Cardinet J, Johnson S, Pini G. Applying Generalizability Theory Using EduG. 2011New York, NY, Rutledge.
33. Cardinet J, Tourneur W, Allal L. The symmetry of generalizability theory: Applications to educational measurement. J Educ Meas 1976; 13:119–135
34. Crossley J, Russell J, Jolly B, et al. ‘I’m pickin’ up good regressions’: The governance of generalisability analyses. Med Educ 2007; 41:926–934
35. Mathieu JE, Day DV. Brannick MT, Salas E, Prince CW. Assessing processes within and between organizational teams: A nuclear power plant example. In: Team Performance Assessment and Measurement: Theory, Methods, and Applications. 1997, pp Mahwah, NJ, Lawrence Erlbaum Associates, 173–195
37. Feldman M, Lazzara EH, Vanderbilt AA, et al. Rater training to support high-stakes simulation-based assessments. J Contin Educ Health Prof 2012; 32:279–286
38. Rosen MA, DiazGranados D, Dietz AS, et al. Teamwork
in healthcare: Key discoveries enabling safer, high-quality care. Am Psychol 2018; 73:433–450
group processes; intensive care unit; interdisciplinary communication; patient safety; quality improvement; teamwork
Supplemental Digital Content
Copyright © by 2018 by the Society of Critical Care Medicine and Wolters Kluwer Health, Inc. All Rights Reserved.