Secondary Logo

Journal Logo

Research Reports

Measuring Medical Housestaff Teamwork Performance Using Multiple Direct Observation Instruments: Comparing Apples and Apples

Weingart, Saul N. MD, MPP, PhD; Yaghi, Omar MB, BCh, BAO; Wetherell, Matthew; Sweeney, Megan

Author Information
doi: 10.1097/ACM.0000000000002238


Team performance is a critical component of effective clinical care. Teams operate throughout health care organizations, yet the degree of interprofessional integration and cohesion varies depending on the nature of the team and the practice setting. Team failures—particularly those involving communication among members of the care team—are a major contributor to malpractice claims in the United States.1

Although team performance is a valued attribute of clinical care, there are few well-established tools for measuring teamwork behaviors. Surveys that elicit participants’ self-assessments of teamwork provide direct access to individuals’ beliefs, attitudes, and experiences.2,3 In contrast, independent assessments of team performance rely on observed behaviors. Although filtered through the observer’s own lens, this approach may address biases inherent in self-assessments and allow for standardized scoring across individuals and teams.4 However, reviews of behavioral rating scales based on direct observation have reported inconsistent terminology and inaccurate comparisons to date, and few empirically tested teamwork observation tools are available for use in health care settings.5–8 An instrument that allows for direct observation of teamwork in diverse health care settings would complement and extend team members’ self-assessments, in turn allowing for a more comprehensive assessment of team performance.

To inform the development of a direct observation tool that is suitable for use on a general medicine unit, we examined the domain composition and concordance of nine previously published instruments. We searched the literature to identify and assess candidate tools, and then pilot tested them among housestaff teams on the medicine wards of a teaching hospital. This is a clinical care and teaching environment that has been largely neglected among studies of teamwork in favor of teams in closed units such as the operating room or emergency department, or in simulation or classroom settings.9–11 We pilot tested the use of selected team observation tools with the goal of assessing the concordance of overall and domain-specific teamwork performance across existing, published instruments. We hypothesized that similar instruments would rate the same team on the same day similarly in terms of overall and domain-specific teamwork performance. Domain concepts included leadership, team structure, situation monitoring, mutual support, and communication.12


Study site

We conducted this study at Tufts Medical Center, a 415-bed Boston teaching hospital serving a diverse patient population. The adult inpatient medical service is organized into disease-specific teams that include general medicine, geriatrics, pulmonology, nephrology, gastroenterology, and infectious disease services. Each team comprises a specialty-trained attending physician; a resident; one or two interns; and medicine, pharmacy, and physician assistant students. While disease-specific teams are assigned to one of four primary nursing units, patients assigned to the team were often located on multiple units because of bed constraints at the time of admission or patients’ particular needs (e.g., negative-pressure room). Staff pharmacists and nurses interact regularly with the physician teams but do not participate routinely in daily work rounds.


We searched Google, PubMed, Medline, and PsycInfo for English-language studies published before January 1, 2017, using the terms team, multidisciplinary, teamwork, collaboration, team behavior, team culture, team effectiveness, and clinical collaboration in combination with measurement, tool, evaluation, instrument, scale, assessment, and survey, with medicine. Additional terms specific to teamwork in medicine that we searched were crisis resource management, interdisciplinary training, health professional teams, patient care team, health care team, interprofessional cooperation, and crew management. We identified 42 relevant articles and abstracts and 6 additional references based on hand-searched bibliographies. We excluded 20 articles that examined team-related theoretical concepts without an observation tool and 12 articles that described teamwork behaviors in clinical settings that were not readily generalizable to a medicine ward service (e.g., anesthesia care in the operating room). Seven additional articles were excluded because they described participant surveys about teamwork using self-report assessments rather than instruments intended for independent, third-party observers.

We selected 9 of the most promising instruments for pilot testing. We prioritized these instruments on the basis of the instruments’ authors’ use of a theory-based approach to instrument development, previous field testing in a simulation or clinical setting, potential applicability to work rounds on a medical or surgical hospital ward, appropriateness for interdisciplinary teams including medical students and housestaff, and evidence of interrater reliability or assessments that support their construct validity.13 The tools, listed in Table 1, were the KidSIM Team Performance Scale (KidSim),14 Teamwork Assessment Scale (TAS),15 Interprofessional Teamwork Evaluation (IPTE),16 TeamSTEPPS Teamwork Perceptions Questionnaire (TPQ),17 Performance Assessment Tools for Interprofessional Communication and Teamwork (PACT),18 Clinical Teamwork Scale (CTS),19 Mayo High Performance Teamwork Scale (MAYO),20 Team Performance Observation Tool 2.0 (TPOT 2.0),21 and Communication and Teamwork Skills Assessment (CTS).22 These tools were developed and piloted in an obstetrics unit, an emergency department, military hospitals, and various educational simulation settings.

Table 1:
Nine Teamwork Observation Instruments Used in an Observational Study of Scoring Housestaff Teamwork Performance, Tufts Medical Center, 2015

Members of the study team (M.W., M.S.) prepared as observers by studying the administration of each tool using published guidance in original articles and materials available on the Internet, or by consulting with the instrument’s original author. Once comfortable with the scoring methodology of each tool, each observer joined a medical team for morning work rounds from 7 am to approximately 9 am on a weekday. We selected teams at random. The observer contacted the resident in advance, described the project, and provided a one-page written project summary. In no case did the team decline to be observed, although the timing of the observation day was occasionally adjusted to address logistical conflicts with teaching or case management conferences.

Observers (M.W., M.S.) pilot tested the instruments in two-hour observation sessions with seven medical teams during morning work rounds in July and September 2015. The goal of pilot testing was to assess the feasibility of a single observer scoring a single team using multiple instruments concurrently and to ensure consistent scoring.

During the observation period, from October through December 2015, one researcher (M.S.) attended work rounds two to three mornings per week, selecting days at random during the observation period. The observer visited a total of 20 medicine ward teams from six disease-specific clinical services. Teams were never identical because of rotating attending and housestaff schedules, although the format was consistent across teams. Work rounds consisted of a “tabletop” review of specialist consultations, overnight events, vital signs, test results, and a preliminary case discussion, followed by beside rounding and finalization of the day’s care plans. Rounds incorporated input from housestaff members of the team, nurses, pharmacists, and other clinicians involved in patients’ care.

The observer scored each team’s performance on all nine instruments on each day, moving iteratively between items and tools through each observation session. Preliminary scores were revised during the course of the session on the basis of subsequent observations. The observer sought to ensure consistent scoring of similar items across instruments. She recorded when there was no opportunity to observe the behavior and when the behavior was not performed. All items on each instrument were completed during the observation session.


Observations recorded on paper forms were entered into an electronic spreadsheet. Because the various observation tools used different rating scales, we normalized item scores from 1 (low) to 5 (high) to allow for comparison. Using normalized item scores, we calculated the mean overall score and domain scores of each team for each observation tool. Because the tools used different numbers of items in each domain, a team’s overall score could reflect its performance in the domain with the most items. To address this potential bias, we calculated a weighted overall score for each tool that weighted the individual items so that each domain contributed equally to the overall team score.

As the primary goal of the project was to compare team observation instruments with one another, we examined domain and overall (weighted and unweighted) scores by team across the nine observation tools using the Kruskal–Wallis statistic. We used the Student t test, comparing each team’s overall mean score with the group mean, to identify potential outliers. We also calculated pairwise correlation coefficients among instruments. We displayed teamwork scores using line graphs and box-and-whisker plots, and domain-specific performance using radial graphics. Analyses used Excel 2010 (Microsoft Corp., Redmond, Washington) and Stata statistical software, version 9.0 (StataCorp, College Station, Texas). This study was reviewed in advance by the hospital’s institutional review board (IRB) and determined to be an educational project exempt from IRB review.


Instrument structure

As shown in Figure 1, the 9 tools together encompassed 5 major domains, with 5 to 35 items per instrument, for a total of 161 items per observation session. Items on different tools addressed similar domains—team structure, leadership, situation monitoring, mutual support, and communication—although the scoring scale varied across tools. Many tools employed a 5-point indicator scale to differentiate levels of performance anchored by behavioral descriptors. Several tools used a 0–10 scale, anchored by the qualitative descriptors from “unacceptable” (0) to “perfect” (10), or “not relevant” to a specific scenario. The Mayo tool asked the observer to rate behaviors on the basis of frequency, from “never” (1) to “consistently” (4). The TPOT 2.0 tool assessed agreement with performance from “strongly agree” to “strongly disagree.”

Figure 1:
Teamwork observation instrument structure, by domain. From an observational study of nine instruments’ scoring of housestaff teamwork performance, Tufts Medical Center, 2015.Abbreviations: KidSim indicates KidSIM Team Performance Scale; TAS, Teamwork Assessment Scale; IPTE, Interprofessional Teamwork Evaluation; TPQ, TeamSTEPPS Teamwork Perceptions Questionnaire; PACT, Performance Assessment Tools for Interprofessional Communication and Teamwork; CTS, Clinical Teamwork Scale; MAYO, Mayo High Performance Teamwork Scale; TPOT 2.0, Team Performance Observation Tool 2.0; CATS, Communication and Teamwork Skills Assessment.

Team performance across instruments

We found considerable differences across instruments based on an individual team’s performance on a given day, as illustrated graphically in Figure 2. Each line in Figure 2 panel A represents the unweighted mean performance of a single team on a single day. Because different instruments have different numbers of items for each domain, differences in performance may be related to the way that individual tools emphasize or deemphasize various teamwork behaviors. To adjust for these differences, Figure 2 panel B presents the adjusted overall mean score for each tool, weighting each domain—rather than each question—equally. Like the unweighted analysis, the weighted analysis showed considerable variability in teamwork performance depending on the observation tool.

Figure 2:
Team performance and instrument scores, from an observational study of nine instruments’ scoring of housestaff teamwork performance, Tufts Medical Center, 2015. Abbreviations: See Figure 1 legend.A. Each line represents the unweighted mean performance of a single team on a single day.B. Each line presents the adjusted overall mean score for each instrument, weighting each domain—rather than each question—equally.

Figure 3 offers an alternative representation of measurement variation across tools for a given team. Using a box-and-whiskers diagram, Figure 3 panel A displays each team’s unweighted mean performance score averaged across all the instruments as a diamond. The bar displays the interquartile range (25th–75th percentile) among the nine instruments used to assess the team on a single day. The whiskers represent the high and low scores. The figure demonstrates statistically significant variation in within-team performance (P = .004, Kruskal–Wallis test). Comparing each team’s mean score against the group mean, team 11’s performance was statistically significantly better (P = .02) and team 18’s was worse (P < .001) than the group as a whole (t test).

Figure 3:
Weighted and unweighted team performance scores, from an observational study of nine instruments’ scoring of housestaff teamwork performance, Tufts Medical Center, 2015.A. Each team’s unweighted mean performance score averaged across all the instruments as a diamond. The bar displays the interquartile range (25th–75th percentile) among the nine instruments used to assess the team on a single day. The whiskers represent the high and low scores.B. Each team’s weighted performance.

Figure 3 panel B displays a weighted analysis, again demonstrating significant variation in same-team performance for the group as a whole (P = .002, Kruskal–Wallis). The degree of variation is particularly noteworthy given the positive correlations among weighted teamwork scores by instrument, as shown in Supplemental Digital Appendix 1, available at While the various instruments produced discordant rankings of performance, they measured similar phenomena.

Comparing the team’s weighted mean score against the group mean, teams 11 and 18 remained the only statistically significant outliers, with teamwork performance above and below the group mean (P = .03 and P < .001, respectively, by t test). While team 18 was identified as the low performer by all nine instruments, team 11 scored highest by only five of the tools. Oddly, 10 of the 20 teams ranked as high-performing teams by at least three of the tools were ranked as low performers by at least three other tools in weighted analyses.

Domain-specific performance

Creating an overall weighted teamwork score assumes that each domain—leadership, team structure, team monitoring, mutual support, and communication—are of equal importance in determining overall team performance. It also assumes that each instrument assigned a similar score to a particular team for a given domain. To understand if this was true, we analyzed the average domain-specific performance scores for each team using the Kruskal–Wallis statistic and displayed domain scores with radar graphs (Figure 4). To simplify the graphs, the first 10 of the 20 team scores are shown.

Figure 4:
Overall weighted domain score for each of 10 teams, from an observational study of nine instruments’ scoring of housestaff teamwork performance, Tufts Medical Center, 2015. Each axis measures team performance by leadership, team structure, team monitoring, mutual support, and communication domains.Abbreviations: See Figure 1 legend.

The figures show that a given team’s performance on a particular teamwork domain varied according to the instrument. We found statistically significant differences among mean domain scores for leadership (P < .001) and team structure (P < .001), with a trend nearing significance for situation monitoring (P = .10).


In this observational study of teamwork performance, we found variation in the rating of individual teams assessed concurrently by a single observer using multiple instruments. The differences persisted in analyses that weighted each teamwork performance domain equally, in an effort to adjust for instruments that included more items in one domain than another. Domain-level analyses also showed variation in performance of a given team on the same teamwork domain incorporated in different instruments. Except for agreement about a single low performer, there was little alignment or consistency across tools in distinguishing high- and low-performing teams. Overall, we found little evidence that existing, published teamwork observation tools offer concordant assessments even of the same team on the same day.

Teamwork is an essential attribute of high-performing medical teams, yet objective assessment of teamwork attributes remains problematic. Driven in part by the successful deployment of the TeamSTEPPS program, a consensus appears to have emerged about the component domains that represent robust teamwork. Our analysis showed at least moderate correlation among ratings performed using disparate tools. Nevertheless, evidence connecting teamwork to clinical outcomes is derived largely from self-reported safety attitudes surveys,3,23 and to a lesser extent by early studies of obstetric complications10,24,25 and the actuarial improvement in malpractice claims among obstetric and anesthesia clinicians after simulation training.26 Tools to facilitate direct observation of teamwork performance are needed to standardize the measurement of teamwork behaviors; to allow for comparison across teams, services, and settings; to understand the contribution of attending supervision; to examine the competencies of students and housestaff; to characterize the level of interprofessional collaboration; and to assess the relationship of these various team attributes with hard clinical outcomes and resource utilization.

Compared with the broad array of tools represented in Valentine and colleagues’8 2015 review of teamwork survey instruments, we selected a smaller number of tools characterized by similar domain structures. This likely reflects the greater number of team attributes that may be accessible in written surveys compared with direct observation, as written surveys allow for self-assessments of subjects’ perceptions, attitudes, and experiences. A minority of surveys in Valentine and colleagues’ study satisfied standard psychometric criteria, and fewer still had statistically significant associations with clinical outcomes. A systematic review of 73 instruments to assess teamwork in internal medicine reported that the majority of tools relied on team members’ self-assessments: Only 5 used “objective” measures to assess performance on a medicine service, and patient outcomes were examined directly in only 13 cases.27 There is little compelling evidence that the portfolio of existing teamwork observation tools are sufficient to support the clinical care, teaching, and research opportunities inherent in team-based care across practice settings.

Of the nine instruments used in our study, the observer found Shrader and colleagues’16 IPTE to be the easiest to score and best suited to assessing team performance on an inpatient medicine service. That said, it may be unrealistic and perhaps undesirable to select a single, best tool. The IPTE, despite its virtues, lacked items that other surveys included related to early identification of safety threats and conflict mitigation. The key attributes of leadership or communication domains may vary depending on the clinical setting. An instrument that captures teamwork in an operating room or emergency department may be poorly suited to assessing teamwork on a medical service or in a labor and delivery suite. In fact, different tools may offer complementary information about a single team’s performance under varying conditions.

This study was limited by the small number of direct observation sessions performed. However, we collected 161 observations per team, or over 3,200 observations in total. The use of a single observer mitigated the risk of interobserver variation but allowed for the possibility of systematic observer bias. A single observer using a consistent scoring methodology should help to reduce the variability that we found across instruments. However, it is possible that the ratings may quantify an observer’s biased assessments; interrater reliability testing is needed in future studies to assess potential confounding. While the observer sought to apply the various tools using the method outlined by their developers, the ability to do so was limited by the availability of training materials and the need to adapt existing tools to a general medicine service. Teamwork during medicine work rounds differs in pace, tasks, and environment from other teamwork settings such as an emergency department or code team, requiring adaptation of the instrument. Some behaviors may be more difficult to observe in this setting or require additional observations to identify subtle clues. Because there is no “gold standard” for measuring teamwork on the basis of direct observation, future studies are required to provide evidence in support of the construct validity and interrater reliability of proposed measurement tools and the relationship between behavioral observations and direct measures of patient safety.

In conclusion, existing, published direct observation tools for teamwork behavior of clinical teams, despite a common domain structure, yielded inconsistent results. Future research is needed to create better instruments for measuring teamwork performance.


1. CRICO Strategies. Malpractice risks in communication failures: 2015 annual benchmarking report. Published 2015. Accessed March 6, 2018.
2. Jain M, Miller L, Belt D, King D, Berwick DM. Decline in ICU adverse events, nosocomial infections and cost through a quality improvement initiative focusing on teamwork and culture change. Qual Saf Health Care. 2006;15:235239.
3. Neily J, Mills PD, Young-Xu Y, et al. Association between implementation of a medical team training program and surgical mortality. JAMA. 2010;304:16931700.
4. Wiener EL, Kanki BG, Helmreich RL. Cockpit Resource Management. 1993.San Diego, CA: Academic Press.
5. Dietz AS, Pronovost PJ, Benson KN, et al. A systematic review of behavioural marker systems in healthcare: What do we know about their attributes, validity and application? BMJ Qual Saf. 2014;23:10311039.
6. Thomas EJ, Sexton JB, Helmreich RL. Translating teamwork behaviours from aviation to healthcare: Development of behavioural markers for neonatal resuscitation. Qual Saf Health Care. 2004;13(suppl 1):i57i64.
7. Huang LC, Conley D, Lipsitz S, et al. The Surgical Safety Checklist and Teamwork Coaching Tools: A study of inter-rater reliability. BMJ Qual Saf. 2014;23:639650.
8. Valentine MA, Nembhard IM, Edmondson AC. Measuring teamwork in health care settings: A review of survey instruments. Med Care. 2015;53:e16e30.
9. Risser DT, Rice MM, Salisbury ML, Simon R, Jay GD, Berns SD. The potential for improved teamwork to reduce medical errors in the emergency department. The MedTeams Research Consortium. Ann Emerg Med. 1999;34:373383.
10. Nielsen PE, Goldman MB, Mann S, et al. Effects of teamwork training on adverse outcomes and process of care in labor and delivery: A randomized controlled trial. Obstet Gynecol. 2007;109:4855.
11. Morey JC, Simon R, Jay GD, et al. Error reduction and performance improvement in the emergency department through formal teamwork training: Evaluation results of the MedTeams project. Health Serv Res. 2002;37:15531581.
12. Weaver SJ, Lyons R, DiazGranados D, et al. The anatomy of health care team training and the state of practice: A critical review. Acad Med. 2010;85:17461760.
13. Downing SM. Validity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830837.
14. Sigalet E, Donnon T, Cheng A, et al. Development of a team performance scale to assess undergraduate health professionals. Acad Med. 2013;88:989996.
15. Kiesewetter J, Fischer MR. The Teamwork Assessment Scale: A novel instrument to assess quality of undergraduate medical students’ teamwork using the example of simulation-based ward-rounds. GMS Z Med Ausbild. 2015;32:Doc19
16. Shrader S, Kern D, Zoller J, Blue A. Interprofessional teamwork skills as predictors of clinical outcomes in a simulated healthcare setting. J Allied Health. 2013;42:e1e6.
17. Keebler JR, Dietz AS, Lazzara EH, et al. Validation of a teamwork perceptions measure to increase patient safety. BMJ Qual Saf. 2014;23:718726.
18. Chiu CJ, Brock D, Abu-Rish E. Performance Assessment Communication and Teamwork (PACT) tool set. Published 2014. Accessed March 6, 2018.
19. Guise JM, Deering SH, Kanki BG, et al. Validation of a tool to measure and promote clinical teamwork. Simul Healthc. 2008;3:217223.
20. Malec JF, Torsher LC, Dunn WF, et al. The Mayo high performance teamwork scale: Reliability and validity for evaluating key crew resource management skills. Simul Healthc. 2007;2:410.
21. Zhang C, Miller C, Volkman K, Meza J, Jones K. Evaluation of the team performance observation tool with targeted behavioral markers in simulation-based interprofessional education. J Interprof Care. 2015;29:202208.
22. Frankel A, Gardner R, Maynard L, Kelly A. Using the Communication and Teamwork Skills (CATS) Assessment to measure health care team performance. Jt Comm J Qual Patient Saf. 2007;33:549558.
23. Sexton JB, Berenholtz SM, Goeschel CA, et al. Assessing and improving safety climate in a large cohort of intensive care units. Crit Care Med. 2011;39:934939.
24. Baker DP, Gustafson S, Beaubien J, Salas E, Barach P. Medical Teamwork and Patient Safety: The Evidence-Based Relation. 2005. Rockville, MD: Agency for Healthcare Research and Quality; AprilAHRQ publication no. 05-0053. Accessed March 6, 2018.
25. Farley DO, Sorbero ME, Lovejoy SL, Salisbury M. Achieving Strong Teamwork Practices in Hospital Labor and Delivery Units. 2010.Santa Monica, CA: RAND.
26. Shannon DW. How a captive insurer uses data and incentives to advance patient safety. Patient Saf Qual Healthc. November/December 2009. Revised November/December 2009. Accessed March 7, 2018.
27. Havyer RD, Wingo MT, Comfere NI, et al. Teamwork assessment in internal medicine: A systematic review of validity evidence and outcomes. J Gen Intern Med. 2014;29:894910.

Supplemental Digital Content

Copyright © 2018 by the Association of American Medical Colleges