Competency in bedside procedures is an expectation for graduates of residency training programs in many specialties,1–4 but procedural competence is not always easily defined. With the advent of simulation models for bedside procedures, assessment of technical skills is increasingly possible and is commonly done using either checklists or global rating scales.5–8
Defining competence as a dichotomous variable poses difficulty for medical educators, especially when assessing trainees’ technical skills using checklists that produce outcome measures that are continuous in nature. Further, setting cut scores to divide performances into those deemed competent versus not competent may seem arbitrary and subjective.9 In response, multiple methods for standard setting have been developed in an attempt to establish reasonable and defensible standards for defining competence.10 Achievement of competency in bedside procedures during residency is a critical aspect of patient safety. Indeed, one study showed that training residents to achieve minimum passing scores on central line insertions had a positive impact on patient outcomes.11
Although checklists are relatively easy and intuitive to use in assessing skills, they have not demonstrated psychometric properties superior to those of global rating scales. Rather, much of the work on clinical skills assessment has shown that global rating scales demonstrate superior reliability and validity measures, as well as better sensitivity to levels of expertise, compared with checklists.12–14
In our previous work15 comparing the use of two checklists with a global rating scale in the assessment of trainees’ central venous catheterization skills, we identified the limitations of checklists in the diagnosis of technical competence. Specifically, we found that a high checklist score did not rule out incompetence. What was not clear was whether this finding would hold true for additional bedside procedures.
We hypothesized that checklists may demonstrate low specificity in the diagnosis of competence and that this hypothesis may hold true across procedures. In this study, we further evaluated the use of procedure-specific checklists versus a global rating scale in the assessment of technical competence in trainee performances of six additional bedside procedures of varying complexity. Our findings may assist educators in deciding which tool may be preferable for assessment purposes and whether our results are generalizable across procedures.
Simulation-based procedural performances
This study was approved by the Conjoint Health Research Ethics Board (CHREB) at the University of Calgary. In 2011, all 56 University of Calgary internal medicine residents (postgraduate years [PGYs] 1–3) were invited to participate in a formative, simulation-based objective structured clinical examination (OSCE), scheduled over six half-days between July and September. The OSCE consisted of six bedside procedures: arterial blood gas sampling (ABG), knee arthrocentesis, intubation, lumbar puncture (LP), ultrasound-guided (UG) paracentesis, and UG thoracentesis. On each half-day, examinations on three to four procedures were conducted in two parallel tracks. All performances were video-recorded. Participating residents were encouraged, but not required, to perform all six procedures; however, residents who were unavailable on a given half-day (e.g., vacation, postcall) were not rescheduled. At the time of the examination, participants were invited to complete a survey collecting demographic and training-level data. Direct linkage of survey data to data from video recordings was performed only for participants who provided written consent. However, permission was granted by the CHREB for analysis of all video-recorded performances, regardless of consent to link with survey data.
Two trained raters evaluated each video-recorded performance using both a procedure-specific checklist and a global rating scale, over a period of five months in 2012. These raters were an internist (I.M.) with 10 years of experience in teaching and assessing procedural skills and a senior medical trainee (A.W.) in her last year of training who had been a certified procedural trainer16 for the residency program for two years. The raters were trained to consensus on the use of the assessment tools for a minimum of two hours for each procedure; this training used a random sample of three to five video-recorded performances per procedure. In assessing performances, the order in which the tools were used was alternated with each assessment to minimize the extent to which a rating on one tool might systematically influence a rating on the other tool.
Raters were blinded to the minimum passing scores on the checklists, and they made pass/fail decisions for each performance on the basis of the overall global assessment rating on the global rating scale (described below). Ratings for this study were performed for research purposes only and posed no consequences for the participants.
Global rating scale
We previously15 developed the global rating scale used in this study by combining relevant items from two published rating scales: the Direct Observation of Procedural Skills (DOPS)17 and the Objective Structured Assessment of Technical Skills (OSATS).7 The resultant tool (see Appendix 1) includes eight items that assess technical competence in the following domains: preprocedure preparation of instruments, analgesia, time and motion, instrument handling, procedural flow and forward planning, knowledge of instruments, aseptic technique, and help seeking. The four items derived from the DOPS are rated on a six-point scale, whereas the four items derived from the OSATS are rated on five-point scales.
The global rating scale also includes a summary item on “overall ability to perform procedure,” which is rated on a six-point scale from “not competent to perform independently” to “above average competence to perform independently.” The raters were asked to rate this summary item using a “global impression”; we did not specify that they should use the ratings on the eight domains to inform this overall global assessment.
Raters were also asked to record comments relating to “red flag” performances (i.e., those they deemed not competent) in an open-ended text response format.
Procedure-specific checklists and standard setting
We created the six procedure-specific checklists based on review of the literature and task analysis. Each checklist was evaluated by a panel of a minimum of six experts from at least two academic health centers (University of Calgary and University of British Columbia). Six experts participated on the expert panels for the following five procedures: ABG (one respirologist, one intensivist, one emergency physician, one anesthesiologist, one respiratory therapist, one internist); knee arthrocentesis (three rheumatologists, one emergency physician, one internist, one orthopedic surgeon); intubation (one respirologist, one intensivist, one respiratory therapist, one emergency physician, two anesthesiologists); LP (one neurologist, one emergency physician, one internist, two hematologists, one anesthesiologist); and UG thoracentesis (two respirologists, one internist, one emergency physician, one intensivist, one general surgeon). Seven experts participated on the expert panel for UG paracentesis (three gastroenterologists, three internists, one general surgeon).
For each checklist, two or more rounds of surveys were conducted between December 2010 and October 2011 to reach consensus in a modified Delphi approach.18 In the first-round survey, we asked the experts to rate the importance of each checklist item on a scale of 1 to 5 (very unimportant to very important) and to indicate whether it should be included in the checklist. We also asked the experts if additional steps should be included in the checklist and to provide additional comments. We revised each checklist on the basis of expert panel input.
In the second-round survey, we asked the experts to comment on the revised checklist items and to rate the difficulty of each checklist item (easy, medium, or hard) and the importance of the item to the overall procedure (marginal, important, or essential). We also asked experts to estimate for each item what percentage of “borderline” trainees would be expected to perform the step correctly. We incorporated this information in the process of setting standards (i.e., minimum passing levels) using the Angoff19 and Ebel20 methods.
The final checklists range in length from 14 to 22 items, with each item representing a procedural step (for the ABG checklist, see Appendix 2; for the other checklists, see Supplemental Digital Appendix 1 at http://links.lww.com/ACADMED/A272). Each item is rated on a three-point scale (no = 0; yes, but = 1; and yes = 2). Items completed correctly are rated “yes,” items not completed are rated “no,” and items completed incorrectly are rated “yes, but.” During data analysis for this study, item ratings were analyzed in a binary fashion. That is, ratings of “no” and “yes, but” were considered as “no” and given zero points, whereas ratings of “yes” were given one point. Overall checklist scores, presented as percentages, were calculated using this binary system in which total points were divided by the number of checklist items.
Diagnosis of competence
Performances with overall global assessment ratings of “competent to perform the procedure independently” or higher were rated competent (“pass”), whereas performances with ratings of “borderline competence to perform independently” or lower were rated not competent (“fail”). Disagreements between the two raters regarding competence were resolved by discussion until they reached consensus and, where necessary, by seeking the opinion of a third reviewer.
Three measurements of checklist scores were then compared with this overall diagnosis of competence: the performance’s overall checklist score and two minimum passing levels, set by the Angoff19 and Ebel20 methods.
Internal reliability was evaluated using Cronbach’s alpha. Correlation between checklist scores and overall global assessment scores was evaluated using the Pearson correlation coefficient. Interrater reliability was assessed using the intraclass correlation coefficient and Cohen’s Kappa where appropriate. Validity evidence was evaluated by linear regression analyses with participant training year as an independent variable and performance scores (checklist scores and overall global assessment scores) as the dependent variable.
The sensitivity and specificity of the overall checklist scores for each procedure, compared with the dichotomous measure of competence on the global rating scale, were evaluated at various checklist cut scores with receiver operating characteristic (ROC) analyses. For each procedure, all possible checklist cut scores from 0% to 100% were considered in an incremental manner. At each possible cut score, performances were classified as “competent” above the cut score and “not competent” below the cut score. The sensitivity and specificity of each cut score’s diagnosis of competence, compared with the classification based on the global rating scale, were then calculated. The area under the curve (AUC) was estimated and used as a performance measure of diagnostic accuracy, where an AUC of 1.0 indicates perfect diagnostic accuracy.21
A two-sided P value of < .05 was considered to indicate statistical significance. All analyses were performed using PASW Statistics version 18.0 (PASW, IBM Corporation, Somers, New York) and Stata 11.0 (StataCorp LP, College Station, Texas).
Forty-seven (83.9%) of the 56 internal medicine residents participated, completing a total of 218 performances (ABG, n = 33; knee arthrocentesis, n = 37; intubation, n = 34; LP, n = 34; UG paracentesis, n = 39; and UG thoracentesis, n = 41). Of the 47 participants, 22 (46.8%) were PGY-3 trainees and 25 (53.2%) were PGY-1 and PGY-2 trainees.
Ninety-one (42%) of the 218 performances were deemed competent based on the global assessment item on the global rating scale. For each procedure, the number (percentage) of participants whose performances were rated competent was as follows: ABG, 18 (55%); knee arthrocentesis, 18 (49%); intubation, 17 (50%); LP, 9 (26%); UG paracentesis, 24 (62%); and UG thoracentesis, 5 (12%).
The global rating scale demonstrated higher internal reliability than the procedure-specific checklist for each procedure (Table 1). Overall checklist scores demonstrated strong correlation with overall global assessment scores from the global rating scale,22 with a range of r = 0.59 for ABG to r = 0.81 for intubation.
Interrater reliability was consistently higher for the checklists than the global rating scale (Table 2). (For procedure-specific data on interrater reliability of individual checklist and global rating scale items, see Supplemental Digital Appendix 2 at http://links.lww.com/ACADMED/A272). However, interrater reliability for decisions on each performance’s competence using the overall global assessment rating was almost perfect.23
Table 3 presents the mean overall checklist and overall global assessment scores by year of training for the 41 participants who consented to the use of their survey data. In general, scores increased with years of training except for LP, knee arthrocentesis, and UG thoracentesis.
Diagnosis of competence using checklists
In ROC analyses, using the overall checklist score to diagnose competence demonstrated acceptable discrimination: The AUC ranged from 0.84 (95% CI 0.72–0.97) for UG paracentesis to 0.93 (95% CI 0.82–1.00) for UG thoracentesis (Table 4). To maximize the sensitivity of the diagnosis of competence to 100% (i.e., the lowest checklist score above which all competent performances—as determined on the basis of the overall global assessment rating—were also rated as competent on the checklist), the cut scores ranged from 58.3% for intubation to 85.7% for LP. In other words, all competent performances scored at least 58.3% for intubation and at least 85.7% for LP.
To maximize the specificity for the checklist’s diagnosis of competence to 100%, the cut score ranged from 76.5% for intubation to 100% for knee arthrocentesis. In other words, a performance with an overall checklist score of 99% in knee arthrocentesis could still be deemed not competent based on the overall global assessment rating.
The minimum passing levels set by the Angoff19 and Ebel20 methods were similar to the cut scores set to maximize sensitivity. However, the minimum passing levels set by these methods were consistently lower than the cut scores set to maximize specificity (Table 4). In general, minimum passing levels demonstrated high sensitivity but poor specificity in the diagnosis of competence.
Appropriateness of pass/fail decisions using global assessments
Performances that were deemed not competent (“fail”) based on the overall global assessment rating appeared justifiably rated, based on our analysis of the written comments from the two raters. For the 127 performances rated as not competent, the raters cited a mean (standard deviation) of 1.5 (0.6) reasons for the rating. The most commonly cited reasons included clinically significant procedural errors such as significant breaches of sterility (n = 67; 52.8% of not competent performances) and laceration or damage of important structures or organs (n = 45; 35.4%). Other reasons included unsuccessful or too many attempts (n = 35; 27.6%), significant flow or knowledge issues (n = 25; 19.7%), and insufficient preparation (n = 16; 12.6%).
In this study evaluating the assessment of residents’ technical competence in six bedside procedures, we found that a global rating scale demonstrated higher internal reliability for rating performances of each procedure than did procedure-specific checklists. Second, raters’ agreement on technical competence, based on the overall global assessment rating, was almost perfect. Third, for the checklists, the minimum passing levels set by traditional standard-setting methods were similar to the cut scores set to maximize the sensitivity in the diagnosis of competence, at the expense of specificity. That is, minimum passing levels were indeed the minimum competence levels; receiving a high score on a procedure’s checklist by no means ruled out incompetence.
Although checklists have the advantages of being easy to use and of being presumably more objective than global rating scales,24 mounting evidence has raised concerns regarding the use of checklists for assessment purposes.25,26 Consistent with existing literature,12,13,27 our global rating scale demonstrated higher internal reliability than our checklists. Others have previously shown that, when administered by experts, global rating scales demonstrated higher sensitivity to levels of training12,14,28 and higher interstation reliability and generalizability12,13,29 than checklists. With this in mind, we believe that assessment using a global rating scale may be superior to assessment using a checklist for the evaluation of technical skills competence.
In the era of competency-based education,30,31 standard setting is an important part of training.32 Although popular standard-setting methods exist, many employ cut scores. As such, setting a cut score too low will increase the probability of false positives in the diagnosis of competence,33 erroneously labeling incompetent performances as competent. On the other hand, setting a cut score too high will increase the probability of false negatives, misidentifying competent performances as incompetent. Furthermore, even employing systematic approaches to establishing cut scores will not necessarily eliminate the subjectivity of the process.10
Pass/fail decisions embody value judgments, with both technical and empirical considerations.33 The checklist cut scores established by our expert panel, using two traditional standard-setting methods, consistently favored minimizing false negatives (high sensitivity) at the expense of increasing false positives (low specificity). Although these standard-setting approaches may be appropriate for high-stakes examinations, they are less desirable for formative evaluations in which educators may wish to maximize the probability of identifying individuals who require additional training.
How can standard setting be improved? One option is for educators to take an iterative approach to standard setting. Further, experts involved in the standard-setting process should be familiar with the purposes of the assessment. This iterative approach may result in increasingly stringent passing standards over time, which may then increase the specificity of the diagnosis of competence.32 A second option may be to explore further the expert value judgments used in establishing standards. A third alternative is to revise and reconsider the type of items that should be included in the assessment tools. Overall, our raters seemed to deem global assessments of performances as not competent (i.e., failing) primarily on the basis of significant procedural errors that they felt affected patient safety, such as breaches of sterility and likelihood of patient injury (e.g., pneumothorax, lung puncture, liver laceration). Although these decisions were largely qualitative, they seem justifiable retrospectively. However, the defensibility of these decisions should be further explored with a larger panel of expert raters to gain a better understanding of the role of clinical errors in standard setting.
Limitations and strengths
Our study has a number of limitations. First, this is a single-center study, which limits generalizability. However, our results are congruent with those of our previous study on assessment of central venous catheterization skills at another institution, which may suggest some degree of generalizability.15 Nonetheless, our results may not generalize beyond postgraduate internal medicine trainees. Second, we assessed performance on simulators. Thus, our results may not translate to assessments of procedures performed on the ward.
Third, for the purposes of our assessments, we primarily evaluated technical skills and did not evaluate communication skills, team work skills, or professionalism. Fourth, our raters were trained. The psychometric properties of the tools we developed may differ dramatically in the hands of untrained raters. As trained assessors may not be readily available in other settings, it may not be feasible to rely solely on global rating scales for the assessment of technical competence by direct observation. Under such circumstances, we believe there remains a role for procedure-specific checklists as those may be easier for untrained raters to use. Rating video-recorded performances, however, may be an option for circumventing issues regarding limited faculty availability.
Fifth, scores did not uniformly increase with increasing training level, a finding that was most prominent for LP, knee arthrocentesis, and UG thoracentesis. One possible explanation for this finding is that, incidentally, all three procedures were taught and assessed at University of Calgary medical school for students in the last year of their medical school training, and many of our residents are also graduates of our medical school. Therefore, it is possible for these skills to deteriorate with time during residency rather than to improve.
Lastly, our study is limited in scope in that our focus was on defining basic technical competence rather than identifying expertise. To achieve true expertise, significant time spent on deliberate practice and clinical experience would be needed.34–36 Although defining expertise was beyond the scope of this study, we wish to emphasize that the journey toward mastering technical skills should not end with the achievement of competence alone.
Despite these limitations, our study has a number of strengths. First, we evaluated performances of six procedures of varying complexity. The uniformity of our results suggests some degree of generalizability to our findings. Second, our checklists were constructed in a systematic fashion, with input from experts in various specialties, which contributes to content validity.
In conclusion, our study demonstrated that for the assessment of technical skills in bedside procedures, the use of a global rating scale consistently yielded higher internal reliability than the use of procedure-specific checklists. Further, overall diagnosis of competence based on a global impression demonstrated excellent interrater reliability and seemed justifiable retrospectively based on raters’ written comments. Lastly, high checklist scores on a procedure did not rule out incompetence. Future research should explore the role of procedural errors in standard setting. On the basis of the results of this study, we recommend the inclusion of a global rating scale in the assessment of procedural skills.
Acknowledgments: The authors wish to thank the experts who served on the expert panels and the participants of this study.
4. Royal College of Anaesthetists. . The CCT in anaesthetics. II: Competency based basic level (specialty training (ST) years 1 and 2). Training and Assessment: A Manual for Trainees and Trainers. 20071st ed http://www.rcoa.ac.uk/system/files/TRG-CU-2007PartII.pdf
. Accessed January 20, 2015
5. Evans LV, Dodge KL. Simulation and patient safety: Evaluative checklists for central venous catheter insertion. Qual Saf Health Care. 2010;19(suppl 3):i42–i46
6. McKinley RK, Strand J, Ward L, Gray T, Alun-Jones T, Miller H. Checklists for assessment and certification of clinical procedural skills omit essential competencies: A systematic review. Med Educ. 2008;42:338–349
7. Reznick R, Regehr G, MacRae H, Martin J, McCulloch W. Testing technical skill via an innovative “bench station” examination. Am J Surg. 1997;173:226–230
8. Ma IW, Sharma N, Brindle ME, Caird J, McLaughlin K. Measuring competence in central venous catheterization: A systematic-review. Springerplus. 2014;3:33
9. Glass GV. Standards and criteria. J Educ Meas. 1978;15:237–261
10. Shepard L. Standard setting issues and methods. Appl Psych Meas. 1980;4:447–467
11. Barsuk JH, Cohen ER, Feinglass J, McGaghie WC, Wayne DB. Use of simulation-based education to reduce catheter-related bloodstream infections. Arch Intern Med. 2009;169:1420–1423
12. Hodges B, McIlroy JH. Analytic global OSCE ratings are sensitive to level of training. Med Educ. 2003;37:1012–1016
13. Regehr G, MacRae H, Reznick RK, Szalay D. Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination. Acad Med. 1998;73:993–997
14. Hodges B, Regehr G, McNaughton N, Tiberius R, Hanson M. OSCE checklists do not capture increasing levels of expertise. Acad Med. 1999;74:1129–1134
15. Ma IW, Zalunardo N, Pachev G, et al. Comparing the use of global rating scale with checklists for the assessment of central venous catheterization skills using simulation. Adv Health Sci Educ Theory Pract. 2012;17:457–470
16. Ma IW, Chapelsky S, Bhavsar S, et al. Procedural certification program: Enhancing resident procedural teaching skills. Med Teach. 2013;35:524
18. Clayton MJ. Delphi: A technique to harness expert opinion for critical decision-making tasks in education. Educ Psychol (UK). 1997;17:373–386
19. Angoff W Scales, Norms, and Equivalent Scores. 19712nd ed. Washington, DC American Council on Education
20. Ebel RL Essentials of Educational Measurement. 1972 Oxford, England Prentice-Hall
21. Faraggi D, Reiser B. Estimation of the area under the ROC curve. Stat Med. 2002;21:3093–3106
22. Cohen J Statistical Power Analysis for the Behavioral Sciences. 19882nd ed. New York, NY Academic Press
23. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174
24. Cohen DS, Colliver JA, Robbs RS, Swartz MH. A large-scale study of the reliabilities of checklist scores and ratings of interpersonal and communication skills evaluated on a standardized-patient examination. Adv Health Sci Educ Theory Pract. 1996;1:209–213
25. Van der Vleuten CP, Norman GR, De Graaff E. Pitfalls in the pursuit of objectivity: Issues of reliability. Med Educ. 1991;25:110–118
26. Norman GR, Van der Vleuten CP, De Graaff E. Pitfalls in the pursuit of objectivity: Issues of validity, efficiency and acceptability. Med Educ. 1991;25:119–126
27. Martin JA, Regehr G, Reznick R, et al. Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg. 1997;84:273–278
28. Gerard JM, Kessler DO, Braun C, Mehta R, Scalzo AJ, Auerbach M. Validation of global rating scale and checklist instruments for the infant lumbar puncture procedure. Simul Healthc. 2013;8:148–154
29. Regehr G, Freeman R, Hodges B, Russell L. Assessing the generalizability of OSCE measures across content domains. Acad Med. 1999;74:1320–1322
30. Cook DA, Brydges R, Zendejas B, Hamstra SJ, Hatala R. Mastery learning for health professionals using technology-enhanced simulation: A systematic review and meta-analysis. Acad Med. 2013;88:1178–1186
31. McGaghie WC, Issenberg SB, Cohen ER, Barsuk JH, Wayne DB. Medical education featuring mastery learning with deliberate practice can lead to better health for individuals and populations. Acad Med. 2011;86:e8–e9
32. Cohen ER, Barsuk JH, McGaghie WC, Wayne DB. Raising the bar: Reassessing standards for procedural competence. Teach Learn Med. 2013;25:6–9
33. American Educational Research Association; American Psychological Association; National Council on Measurement in Education. Standards for Educational and Psychological Testing. 2014 Washington, DC American Educational Research Association
34. Ericsson KA, Krampe RT, Tesch-Römer C. The role of deliberate practice in the acquisition of expert performance. Psychol Rev. 1993;100:363–406
35. Ericsson KA. Deliberate practice and the acquisition and maintenance of expert performance in medicine and related domains. Acad Med. 2004;79(10 suppl):S70–S81
36. Sznajder JI, Zveibil FR, Bitterman H, Weiner P, Bursztein S. Central vein catheterization. Failure and complication rates by three percutaneous approaches. Arch Intern Med. 1986;146:259–261