Validation of a Tool to Measure and Promote Clinical Teamwork : Simulation in Healthcare

Journal Logo

Empirical Investigations

Validation of a Tool to Measure and Promote Clinical Teamwork

Guise, Jeanne-Marie MD, MPH; Deering, Shad H. MD; Kanki, Barbara G. PhD; Osterweil, Patricia BS; Li, Hong MD; Mori, Motomi PhD; Lowe, Nancy K. PhD, CNM

Author Information
Simulation in Healthcare: The Journal of the Society for Simulation in Healthcare 3(4):p 217-223, Winter 2008. | DOI: 10.1097/SIH.0b013e31816fdd0a
  • Free


International reports such as the National Health Services,1,2 Institute of Medicine,3 and The Joint Commission4 report that human factors such as communication and teamwork often play a major role in adverse events. Reviews from malpractice claims, sentinel events, and the literature, consistently find that communication and teamwork are among the top contributors to adverse events and malpractice claims. An analysis of closed malpractice claims in one system found that 31% of adverse events were attributable to communication problems.5 A national review, conducted by the Joint Commission, found that over two-thirds of sentinel obstetric events where the infant died or had severe brain damage were attributed to human factors and communication.4 Similarly, a recent review demonstrated that up to 40% of all pregnancy-related maternal deaths were potentially preventable.6 Because of this, the implementation of team training analogous to the Crew Resource Management (CRM)7 curriculum in aviation was recommended by the Institute of Medicine in their 2000 report “To Err is Human”3 and also by the Joint Commission’s Patient Safety Plan (The Joint Commission: Sample Outline for a Patient Safety Plan. Available at: March 23, 2004.) as an important means to improve patient safety.

This study is one of several studies conducted by the State Obstetric and Pediatric Research Collaborative OB Safety Initiative. This particular study is a pilot validation study that was conducted as part of a larger multicenter study funded by the Agency for Healthcare Research and Quality (AHRQ U18-H8015800) designed to test whether simulation of obstetric emergencies and teamwork training improve the process of care and patient safety across hospital settings. Clinical drills, didactics, and simulations are being used by many organizations to improve communication and teamwork and ultimately improve patient safety. A reliable tool to measure teamwork is needed to evaluate the effectiveness of teamwork interventions and to learn more about what predicts good teamwork in simulated and clinical settings. We sought to develop and validate an easy-to-use tool to measure teamwork factors and their contribution to overall team performance in both simulation and clinical settings. Our goal was to develop a tool that could be used in the field to assist in debriefing team simulations and also by clinical teams to evaluate teamwork skills in routine and emergent clinical care.


Clinical Teamwork Scale Development

The Clinical Teamwork Scale™ (CTS) was developed based on team training components important in crew resource management (CRM) training program.7 As shown in Figure 1, the CTS contains 15 items in the 5 conceptual teamwork domains of communication, situational awareness/resource management, decision-making, role responsibility (leader/helper), and patient-friendliness. Although items were written to be easily understood on face value, descriptive item definitions were developed to orient raters to the intent of each item (Table 1). For example, a complex topic such as communication was further delineated as to orient new members, transparent thinking, directed communication, and closed loop communication to improve efficiency and clarity in ratings. For all except 1 item (target fixation) a 0 to 10 rating scale was adopted. The target fixation item is evaluated as a “yes” or “no” response because we determined that this behavior is either present or not. The scale values are further anchored by the qualitative descriptors of “unacceptable” (0), “poor” (1–3), “average” (4–6), “good” (7–9), and “perfect” (10). In addition, each item has a “not relevant” option if the evaluator believes that the item was irrelevant to teamwork in a specific scenario. We chose to use a scale of 0 to 10 to increase the rating range for 2 reasons. First, clinical teams in practice are likely to perform above average yet additional improvements are possible, and second, the tendency of raters to evaluate teams favorably.8,9 Both of these phenomena could lead to an inability to discriminate smaller differences in performance among qualified teams when using a 3 or 5-point scale. We thought that having 3 possible scores in each qualitative level of team performance, much like A+, A, A−, would enable us to measure and observe incremental scoring differences and would be comfortable for nurses, providers, and staff in the field when rating themselves and colleagues.

Figure 1.:
CTS—Clinical Teamwork Scale™ (Global).
Table 1:
Clinical Teamwork ScaleTM (Global) Descriptive Anchors

CTS Validation

Standardized videos of teams responding to the same obstetrical clinical scenario (shoulder dystocia) were created simulating unacceptable/poor (scale 0–3), average (scale 4–6), and good/perfect teamwork (scale 7–10). Four actors played the same specified roles in all videos with differing teamwork “scripts.” The clinical scenario was identical with the total scenario length ranging from 5 to 6 minutes. Three evaluators (1 perinatologist, 1 generalist obstetrician/gynecologist, and a doctorally prepared certified nurse-midwife) independently viewed the unidentified videos in no specific order and scored teamwork using the CTS. There was no communication among reviewers and they were physically separated during video review. We did not emphasize extensive training in use of the CTS because our intent was to develop a tool that could be used by clinicians for their own evaluation of teamwork during everyday clinical practice. The evaluators were trained in the principles of CRM through reading the literature, mentoring by CRM experts from the National Aeronautics and Space Administration, attending structured CRM training programs, and involvement in clinical research on team training for obstetrical emergencies. In addition, the evaluators, National Aeronautics and Space Administration experts, and statisticians discussed how the CTS would be used (for example, nonapplicable vs. not observed).

Statistical Analyses

Data analyses were performed using SAS version 9.2 (SAS Institute, Research Triangle Park, NC). Construct validity was determined by evaluating the rating distribution and median of scored ratings (0–10) compared with the a priori designed teamwork level of the specific scenario. Usability was assessed by completeness (number of all items rated) and accuracy (number of ratings that fall within ±1 point of the scale range for the designed teamwork level) of ratings. Reliability of the CTS was tested by interrater agreement (Kappa statistic),10–12 concordance (Kendall coefficient)13 and correlation (Pearson coefficient) of scored ratings (scale 1–10), and intraclass correlation of interrater reliability (interclass correlation coefficient).14

Reliability of the CTS ratings also was examined by estimating the variance of each component based on generalizability theory.15,16 Because each scenario was designed to be distinctly different, the analysis was stratified by scenario. The 1-facet generalizability evaluation was modeled with the components of raters and interaction of rater and item using the restricted likelihood method to estimate the variance of components. The percentage of variance attributed to each component was calculated by dividing the component variance by the total variance in the model (component and error).


Construct Validity and Usability

A good evaluation tool must demonstrate construct validity and usability (ease of use). To determine construct validity, we examined the distribution and median score for each scenario according to its predetermined teamwork level. Table 2 shows the scores from 3 raters under 3 different scenario cases. For scenario 1 (poor teamwork), the majority of scores (80%–82%) and the median score from all 3 raters were in the range 1 to 3 indicating poor teamwork. For scenario 2 (good to perfect teamwork), the majority of scores (60%–80%) and the median scores were 9 to 10, consistent with very good to perfect performance. For scenario 3 (average teamwork), the majority of scores (73%–80%) and the median scores fell in the average score range of 4 to 6. Therefore, scores from all 3 raters corresponded with the a priori designed teamwork level for each scenario. Additionally, raters were able to rate almost every item (4 missing) indicating that the tool was easy to use. Completeness and accuracy of the ratings across all 3 scenarios by 3 raters are presented in Table 3. Completeness ranged from 88.9% to 100% among rating items. A rating was deemed “accurate” if the item score fell within ±1 point of the scale range for the intended scenario teamwork level. Twelve of 15 items had 100% accurate rating, and 3 items (target fixation, role clarity, and overall role responsibility) had accuracies of 66.7% to 88.9%.

Table 2:
Evaluator Ratings by Scenario
Table 3:
Completeness and Accuracy of Item Ratings Across All Scenarios


Figure 2 displays the correlation scatter plots of scores between pairs of raters with the x and y axes representing respective rater scores. When there is a perfect correlation between raters, all points are found directly on the diagonal cross line. The more distant from the line, the more discordant are the scores between the 2 raters. The overall item score correlations between raters were excellent with Pearson correlation coefficient between 0.94 and 0.96. There also were major dispersions in the plots indicating that the range of scores was used by the raters across the 3 scenarios. The interclass correlation coefficient of interrater reliability was also excellent at 0.98 (95% confidence interval = 0.97–0.99).

Figure 2.:
Score correlations between pairs of raters across scenarios.

The degree of score concordance and agreement among 3 raters are showed in Table 4. The Kappa method was applied using 4 categories, unacceptable/poor (0–3), average (4–6), good/perfect (7–10), not relevant (11). In general, Kappa values between 0.6 and 0.8 indicate substantial agreement, and levels exceeding 0.8 are considered excellent.12 The overall agreement among raters was substantial (Kappa = 0.78), and agreement among teamwork levels (poor, average, and good/perfect) was quite high. Because we thought that categorization of the scores into 4 groups may result in the loss of some pertinent information, we also applied Kendall’s method for ordinal data to those items that were scored 1 to 10 by all 3 raters. Note that this analysis excluded any items that were scored “unacceptable” or “not relevant,” or were not scored (missing) by any 1 rater. Concordance among the 3 raters was considerably higher with a Kendall coefficient of 0.95.

Table 4:
Degree of Agreement Among 3 Raters

The variance in scores among raters was also estimated (Table 5). Overall, total variance in all 3 scenarios was relatively small (<1.0) in relationship to the potential rating variation of 0 to 10. However, the estimated total variance was higher in scenario 3 (0.905) than in scenario 1 (0.729) and scenario 2 (0.214). The largest percentage of variance across scenarios was because of rater and item interaction, indicating that the variance within each scenario depended primarily on the idiosyncratic interpretation of items by individual raters. Variance due to the rater was nearly zero in the poor and good/perfect scenarios and minimal in the average scenario (0.299).

Table 5:
Variance Estimates Within Each scenario

To examine potential sources of systematic error, rater score variation across the items were examined by plotting raters by items for each scenario. As illustrated in Figure 3, more discrepancy and less consistency among raters were observed across the items in scenario 3 (average teamwork) but also particularly for the items addressing leadership and role responsibility (items 11–13). Though less profound, a similar trend can be observed across ratings in the poor scenario.

Figure 3.:
Item score correlations among raters by scenario.


Given international recommendations for interdisciplinary team drills, and the increasing focus on patient safety and reducing preventable safety events in healthcare, it is essential to have tools that can evaluate teamwork to measure the effectiveness of teamwork interventions. Several evaluation tools have been proposed,9,17–23 but few have been validated.9,17,21

At the time we began our work, the available teamwork measurement tools did not meet our requirements for a tool that could be used both during simulation and for the evaluation of teamwork during everyday clinical care. Though developed for use during our obstetric emergency simulations, we intentionally designed the tool as a universal measure of teamwork so that it could be mixed into various curricula where teamwork evaluation may be helpful as well as into teamwork quality programs where unit or hospital staff desire to evaluate their teamwork during emergencies or everyday clinical care on an ongoing or periodic basis. We were also interested in developing a tool that was applicable across a wide variety of clinical environments and was independent of specialty or type of institution from small rural settings to complex teaching environments.

The CTS tool presented in this article, is not intended to measure individual performance of teamwork skills but to provide an overall evaluation of the use of teamwork skills by a group of healthcare providers working together during a simulation exercise or in the provision of clinical care. Although several other teamwork scales have appeared in the literature, the majority of these instruments differ from the CTS in that they were designed to evaluate the teamwork skills of individuals17,18,21; or for specific clinical environments/situations such as the operating room,24 emergency department,25 during neonatal resuscitation,26 or during simulated obstetrical emergencies in an operating room environment.27 Two new scales, the Mayo High Performance Teamwork Scale (MHPTS)9 and the Communication and Teamwork Skills Assessment (CATS)28 are more similar to the CTS but with clear differences. The CATS was designed for use by trained observers of simulated or real clinical events to arrive at a quality score for each of 4 overall behavioral categories (coordination, situational awareness, cooperation, and communication) from 18 different behaviors rated on a 3-point scale. The CATS has 3 additional behaviors in the categories of coordination, cooperation, and communication that are rated if a crisis situation arises. In contrast, the MHPTS is a 16-item tool (8 items are always rated and 8 are rated if applicable) designed to briefly assess CRM skills in training settings by training participants to both assess the effectiveness of training but encourage engagement in self-reflective training experiences. These purposes are very similar to our intent for the CTS. Although initial psychometric data has been published on each of these scales and both share similarities to the CTS, neither has been validated against known standards of poor, good, or excellent teamwork as has the CTS.

Since one of our goals was to develop a tool that could be used in the clinical setting by any healthcare personnel, we thought it was essential that the concepts measured could be easily understood with minimal training. To that end, we broke down important concepts into explicitly observable behaviors and provided descriptive anchors. For example, communication was further itemized into orient new team members [eg, with SBAR (Situation, Background, Assessment, Response)], transparent thinking (thinking out loud), directed communication (directing a message to a specific individual), and closed loop communication (acknowledgment of receipt of the message and status). The CTS had to cover the main domains of teamwork, yet be brief, to be useful to busy clinical personnel and to evaluators during fast moving simulations. Additionally, we tried to avoid concepts that required observations of multiple skills (in essence compound skills) as this would be challenging for untrained clinical personnel and could introduce substantial measurement variance and ambiguity into the ratings. Finally, the CTS focuses on the functioning of the total team rather than the performance of individual members.

Studies have found that respondents tend to rate themselves and their colleagues favorably.8,9 This feature makes it quite challenging to measure differences in teamwork unless they are extremely different. In general, clinical teams are likely to perform well (average or better), yet even good teams would like to improve. Therefore, we needed to create a scale with greater response discrimination to measure less than extreme differences. Thus, we chose to use a 10-point scale, with the 2 extremes unacceptable and perfect and 3 scoring options within each category; similar to the A+, A, A− grading system. Other recently published tools of overall team performance such as the CATS and the MHPTS use more limited 3-point rating scales for each item from “expected but not observed” to “observed and good” and from “never or rarely” to “consistently.” Only additional research with these tools and the CTS will determine the utility of these various scaling methods to validly, reliably, and sensitively measure the complex behaviors that differentiate team performance under both simulated and real clinical situations across settings and specialties.

Our study, like any, is not without limitations. Our first goal was to develop a high quality teamwork measurement tool. Because of this, we concentrated initially on methodological and technical performance using standardized videos and consistent evaluator training (to reduce heterogeneity). Thus, we intentionally limited evaluator numbers to enhance homogeneity of teamwork training. Now that we have demonstrated quality as measured by standard parameters, we are expanding our focus to testing the tool across a diversity of clinical and simulation personnel (the introduction of heterogeneity). To that end, we have now implemented the tool in the field for rapid sequence evaluation of teamwork during 10-minute simulated scenarios. To date, we have found it easy to use and helpful to guide team debriefings and we are continuing to evaluate the tool and plan to analyze this data further in these “real-world” simulations.


1. Confidential Enquiry into Maternal Deaths in the United Kingdom. Why Mothers Die. London: Royal College of Obstetricians and Gynaecologists; 1999.
2. Chief Medical Officer. An Organisation with a Memory. Report of an Expert Group on Learning from Adverse Events in the NHS. London: HMSO; 2000.
3. Kohn LT, Corrigan JM, Donalson MS. To Err Is Human: Building a Safer Health System. Washington, DC: Institute of Medicine: National Academy Press; 2000.
4. The Joint Commission. Preventing infant death and injury during delivery. Sentinel Event Alert 2004;30:1–3.
5. White AA, Pichert JW, Bledsoe SH, et al. Cause and effect analysis of closed claims in obstetrics and gynecology. Obstet Gynecol 2005;105(5 pt 1):1031–1038.
6. Berg CJ, Harper MA, Atkinson SM, et al. Preventability of pregnancy-related deaths: results of a state-wide review. Obstet Gynecol 2005;106:1228–1234.
7. Crew Resource Management Training. Advisory Circular. US Department of Transportation. FAA 2004; AC No. 120E–151E.
8. Jankouskas T, Bush MC, Murray B, et al. Crisis resource management: evaluating outcomes of a multidisciplinary team. Simul Healthcare 2007;2:96–101.
9. Malec JF, Torsher LC, Dunn WF, et al. The Mayo high performance teamwork scale: reliability and validity for evaluating key crew resource management skills. Simul Healthcare 2007;2:4–10.
10. Fleiss JL. Statistical Methods for Rates and Proportions. 3rd ed. New York: John Wiley & Sons, Inc; 2003.
11. Fleiss JL, Nee JCM, Landis JR. Large sample variance of kappa in the case of different sets of raters. Psychol Bull 1979;86:974–977.
12. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–174.
13. Kendall MG. Rank Correlation Methods. 2nd ed. London: Charles Griffin & Co. Ltd; 1955.
14. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86:420–428.
15. Brennan R. Generalizability Theory. New York: Springer-Verlag, Inc; 2001:4.
16. VanLeeuwen DM, Barnes MD, Pase M. Generalizability theory: a unified approach to assessing the dependability (reliability) of measurements in the health sciences. J Outcome Meas 1998;2:302–325.
17. Fletcher G, Flin R, McGeorge P, et al. Anaesthetists’ Non-Technical Skills (ANTS): evaluation of a behavioral marker system. Br J Anaesth 2003;90:580–588.
18. Gaba DM, Howard SK, Flanagan BF, et al. Assessment of clinical performance during simulated crises using both technical and behavioral ratings. Anesthesiology 1998;89:8–18.
19. Kim J, Neilipovitz D, Cardinal P, et al. A pilot study using high-fidelity simulation to formally evaluate performance in the resuscitation of critically ill patients: The University of Ottawa Critical Care Medicine, High-Fidelity Simulation, and Crisis Resource Management I Study. Crit Care Med 2006;34:2167–2174.
20. Thomas EJ, Sexton JB, Lasky RE, et al. Teamwork and quality during neonatal care in the delivery room. J Perinatol 2006;26:163–169.
21. Weller JM, Bloch M, Young S, et al. Evaluation of high fidelity patient simulator in assessment of performance of anaesthetists. Br J Anaesth 2003;90:43–47.
22. Wright MC, Luo X, Richardson WJ, et al. Piloting team training at Duke University Health System. Paper presented at: Proceedings of the Human Factors and Ergonomics Society 51st Annual Meeting; 2006; Santa Monica, CA.
23. Larison KA, Butler JT, Schriefer JA, et al. Improving team communication at delivery among obstetric, anesthesia and neonatal team members using didactic instruction and on-site simulation-based training. Simul Healthcare 2006;1:97.
24. Healey AN, Undre S, Vincent CA. Developing observational measures of performance in surgical teams. Qual Saf Health Care 2004;13(suppl 1):i33–i40.
25. Morey JC, Simon R, Jay GD, et al. Error reduction and performance improvement in the emergency department through formal teamwork training: evaluation results of the MedTeams project. Health Serv Res 2002;37:1553–1581.
26. Thomas EJ, Sexton JB, Helmreich RL. Translating teamwork behaviours from aviation to healthcare: development of behavioural markers for neonatal resuscitation. Qual Saf Health Care 2004;13(suppl 1):i57–i64.
27. Morgan PJ, Pittini R, Regehr G, et al. Evaluating teamwork in a simulated obstetric environment. Anesthesiology 2007;106:907–915.
28. Frankel A, Gardner R, Maynard L, et al. Using the Communication and Teamwork Skills (CATS) Assessment to measure health care team performance. Jt Comm J Qual Patient Saf 2007;33:549–558.

Validation study; Safety; Teamwork; Obstetrics; Health care; Patient care team; Simulation; Quality assurance; Quality of care; Interprofessional relations; Clinical competence

© 2008 Lippincott Williams & Wilkins, Inc.