The assessment of medical competence at various levels of training continues to be an area of growing interest. The shift toward broader, programmatic assessment has revolutionized the approaches that many take in assessing trainees’ competence. “Programmatic assessment” acknowledges the limits of human judgment and combines information from various assessment measures to enhance the quality of competency-based assessments.1 According to Schuwirth and van der Vleuten, assessment of medical competence should be fair, transparent, and defensible.2 As many medical schools and external organizations strive to develop assessment programs that meet these criteria, there is an ongoing need to better understand the relationships that exist between various quantitative and qualitative indicators of competence.
Previous research has identified some of the pitfalls of relying solely on a quantitative psychometric framework; these include a lack of reliability or accuracy, rater bias, and assessment information that is lost or distorted in translation.3–5 As studies continue to highlight shortcomings of quantitative approaches for rater-mediated assessments,6 there exists a greater push toward the inclusion or use of qualitative data such as performance reports and narrative comments. Existing literature examining written comments indicates that narratives correlate with assigned grades,7 can be used to assess learning and rank residents,8 and provide a more holistic impression of residents’ competence.9
Although there has been a great deal of support for the use of narrative comments within assessment programs,10 qualitative measures are not without challenges. Several studies highlight the generally poor quality of written comments.11–13 Furthermore, some individuals are reluctant to provide written comments,12 especially when the information is shared with the learner, often because they anticipate that their comments will be challenged or appealed.14
In addition to issues related to the quality of comments, individuals often conflate qualitative assessments (e.g., written comments) with feedback, which is problematic because assessment and feedback can have very different purposes. Assessment is typically viewed as a summative activity used to report on the learning that has occurred while also fulfilling an accountability role. Feedback is seen as formative as it technically requires two-way communication to support learners through future levels of training. The dialogical aspect of feedback is crucial because without a conversation between the assessor and learner, direct observations are merely assessment data.
In an effort both to satisfy assessment requirements and to provide valuable feedback to learners,15 assessors have developed methods of communicating using a so-called “hidden code.”10 Assessors have been using this hidden code to satisfy reporting and accountability requirements without negatively affecting learners. A hidden code occurs when assessors purposefully omit elements and use vague language to describe a learner’s performance.16 For example, the comment “Good resident, improving well” is an important signal that the assessor is negatively interpreting the ability of this resident. Comparatively, narrative comments such as “I would hire him,” “Excellent shift—saw many complex patients,” or “Ready to be a senior resident; managed a difficult airway efficiently and impressively” all signal a positive interpretation of a resident’s ability. The difficulty with using a hidden code is that those unfamiliar with the language and context may overlook or misinterpret assessors’ intended meanings. Furthermore, this discourse can make the information obtained from quantitative and qualitative assessments seem misaligned.
To capitalize on assessment data collected from assessors, the medical education community needs a way to successfully integrate quantitative and qualitative assessment data. Current research investigating associations between quantitative and qualitative measures of medical competence is essentially nonexistent. One explanation for this gap relates to the discourse surrounding competency-based medical education. A false dichotomy of quality versus quantity has emerged,17 leading to an epistemological divide that contributes to the lack of studies examining associations between quantitative and qualitative measures. In an attempt to close this gap, we have explored the relationships that exist between assessors’ quantitative and qualitative workplace-based assessments (WBAs) of medical competence. Understanding these relationships is crucial for developing robust assessment programs that yield defensible decisions regarding a trainee’s progression.
Specifically, we hypothesized that relationships exist between different quantitative and qualitative assessments (e.g., checklist scores and written comments), which are not currently being explored. For instance, we believed the selection of a “done, but needs attention” checklist option would prompt assessors to provide qualitative comments to supplement and explain what went well and, importantly, what requires further attention. Therefore, we addressed the following research questions: (1) What are the relationships among checklist scores, task ratings, global ratings, and written comments? and (2) How are checklist scores and global ratings associated with balanced narrative comments (i.e., quality comments that address both strengths and areas of improvement)? As knowledge and understanding of associations between different quantitative and qualitative data increase, medical educators can begin to purposefully tailor assessment programs to capture meaningful information both to support learners’ development and progression and to meet accountability requirements.
We used data for the current study from postgraduate year (PGY)-1 and PGY-2 residents training from September 2012 to August 2014. These residents, part of McMaster University’s Emergency Medicine Postgraduate Program, were assessed using the McMaster Modular Assessment Program (McMAP).18 Further details about McMAP (and access to the assessment tool) are available from Chan and Sherbino.18 McMAP comprises various WBAs all blueprinted to the CanMEDS framework.19 During postgraduate training, faculty members or senior residents (n = 68) completed WBAs specifically developed for each CanMEDS role (i.e., medical expert, communicator, collaborator, professional, scholar, manager, and health advocate) to assess PGY-1 and PGY-2 residents’ readiness for unsupervised practice. Each assessment task entailed marking a checklist, providing task and global performance ratings, and writing narrative comments. The Hamilton Integrated Research Ethics Board reviewed this study and deemed it exempt from requiring ethics approval.
Instruments and procedure
As previously mentioned, McMAP comprises various WBA tools blueprinted to the CanMEDS roles. Each assessment task consists of five parts:
- A task checklist with four score options (i.e., “done,” “done but needs attention,” “not done,” and “not applicable for case”);
- A task rating instrument for which assessors provide a rating for each specific task;
- A task comment box in which the assessor is required to provide narrative comments to support the task rating and offer next steps;
- A global rating designed to capture a resident’s holistic performance throughout a shift; and
- A global comment box in which the assessor can comment on a resident’s overall performance in order to prompt a formative dialogue.
At particular milestones throughout residents’ training, these WBAs are compiled and evaluated to assess and monitor the readiness of postgraduates for unsupervised practice.
For this study, we used McMAP assessment data from only the communicator and collaborator roles as these were the only two roles that yielded sufficient data to meet sample size requirements for regression analyses. One of us, an expert specializing in both education and emergency medicine (T.M.C.), assessed all of the written comments from the sample McMAP assessments. Using a modified version of the completed clinical evaluation report rating (CCERR),20 this expert evaluated both the task and global comments using a five-point rating scale, which was anchored at three points (i.e., 1 = “Not at all,” 3 = “Acceptable,” and 5 = “Exemplary”).
Regression analyses (quantitative)
To examine the relationship among checklist scores, task ratings, global ratings, and narrative comments, we conducted logistic regression analyses on data from all the McMAP assessments in the sample using R, version 3.1 (Open Source). We constructed the main logistic regression model to determine which components of the task or global ratings and narrative comments were associated with the “done, but needs attention” checklist option. This model explored the following five variables:
- Task ratings (as assessed by the faculty member supervising the resident’s shift);
- Global rating (also as assessed by the faculty member supervising the resident’s shift);
- The quality of the resident’s strengths related to a particular task (as conveyed by the faculty member and described in the modified CCERR20 [see Supplemental Digital Appendix 1 at https://links.lww.com/ACADMED/A451]);
- The quality of the resident’s weaknesses related to a particular task (as conveyed by the faculty member and described in the modified CCERR20); and
- The length (i.e., number of words) of the comment written about the task.
To validate the results of the first model, we built an ordinal logistic regression model to further explore the relationship between “done, but needs attention” and residents’ overall strengths and weaknesses as assessed in the global comments. Using the ratings of the established expert (T.M.C.) on the modified CCERR,20 we evaluated comments according to the degree to which they balanced strengths and opportunities for improvement; we used these ratings as the dependent variable in this model. The two independent variables included in this model were (1) “done, but needs attention” and (2) the faculty member’s global rating of the resident. We calculated the “done, but needs attention” variable based on the proportion (i.e., percentage) of individual checklist items that were assessed as “done, but needs attention.” The global rating of each resident was provided by the faculty member based on his/her assessment of the resident’s performance throughout the shift. Prior to interpreting the results, we validated both regression models to ensure that they satisfied the underlying model assumptions.
Content analysis (qualitative)
Following the regression analysis, we collected and reviewed all task and global comments. We assigned the comments to one of two categories: “done” or “done, but needs attention.” We assigned a comment to the “done, but needs attention” category if the “done, but needs attention” checklist option was used at least once. We closely analyzed and coded the comments within these two categories using the positive and negative red flags identified by Ginsburg and colleagues10 to better understand the types of comments that typically accompany situations in which an assessor deemed that a resident either fully completed a task or completed a task but still required attention. Two of us (S.S.S. and D.A.K.) conducted this portion of the analyses, reaching consensus through discussion. The intention of the written comments is to provide information about the residents’ performance, which can be used to further inform decision makers (i.e., clinical competence committees) with specific details regarding each resident’s progress.
The sample consisted of 342 independent McMAP assessments (158 PGY-1 assessments and 184 PGY-2 assessments) of 23 residents over a two-year period (i.e., each resident was assessed multiple times on seven different assessment tasks). We analyzed the McMAP assessments, which were designed to evaluate the communicator and collaborator CanMEDS roles of residents, using logistic regression and content analysis. Of the 342 McMAP assessments, 104 (30.4%) of them were missing a task comment and 93 (27.2%) were missing a global comment. This surprisingly high degree of missing comments is most interesting because the system’s current electronic format requires assessors to provide narrative comments (to expand on their numerical scores) before they can proceed to the next portion of the assessment. Nevertheless, assessors found ways to avoid providing narrative comments.
After examining the underlying assumptions, we performed a logistic regression analysis to examine the outcome (i.e., the proportion or percentage of items marked “done, but needs attention” in the McMAP checklist). We constructed both logistic regression models using data from all 342 McMAP assessments. In cases for which a task or global rating was missing, we removed the entire assessment from the analysis. Of the 321 complete ratings, 62 were marked “done, but needs attention.” A test of the logistic regression model examining the use of the “done, but needs attention” checklist option was statistically significant: R2 = 0.31, χ2 (5, N = 321) = 68.27, P < .001. This finding suggests that some of the five variables are associated with whether assessors select the “done, but needs attention” checklist option. Table 1 shows the regression values for each individual variable.
Our expert’s (T.M.C.’s) ratings of strengths and weaknesses within individual comments, using the CCERR,20 appeared to be, according to the Wald criterion, associated with assessors’ selection of the “done, but needs attention” checklist option (see Table 1). Additionally, the most reliable variable associated with use of the “done, but needs attention” option was the task rating provided by the faculty member at the time of assessment. Neither the length of the task comment nor faculty assessors’ global ratings were statistically significant variables according to this model. These results support the interpretation that when faculty assessors noted and commented about weaknesses in a resident’s performance, they were also more likely to select the “done, but needs attention” checklist option. Conversely, when faculty assessors provided more specific strengths in their comments, they were less likely to use the “done, but needs attention” option. Finally, we observed a relationship between faculty assessors’ task rating and their use of the “done, but needs attention” checklist option (i.e., the higher the task rating, the less they selected this option).
To substantiate the results of the first model, we analyzed an ordinal logistic regression model that examined the relationships among the “done, but needs attention” checklist option, residents’ global rating scores, and global written comments. The dependent measure for this analysis was the expert’s (T.M.C.’s) rating (using the modified CCERR20) of balance within global comments—more specifically, the extent to which global comments provided both strengths and areas of improvement. We created this second model to test relationships between the checklist options and written comments (based on whether they contained both strengths and areas for improvement). This model was statistically significant: R2 = 0.041, χ2 (2, N = 321) = 10.67, P = .005. The only significant variable, however, in this model was use of the “done, but needs attention” checklist option (b = 0.045; P = .001), which suggests that when assessors select the “done, but needs attention” option, the associated comments contain both strengths and areas for improvement. That is, when faculty selected the “done, but needs attention” option, they tended to defend or explain their judgments by elaborating on what the residents did well (strengths) and what tasks require further care and consideration (areas of improvement).
We completed an analysis of task and global McMAP comments to explore whether assessors communicate using a hidden code.10 Written comments associated with the “done, but needs attention” checklist option read quite differently than those accompanying the “done” option. Global comments associated with the selection of “done, but needs attention” often reiterated comments focused on the task. Assessors used the word “good” most frequently in their “done, but needs attention” comments and, in most instances, followed up with an explanation that noted a deficit.
Good assessment note. [Reassessment] notes need to be expanded and documented.
Good knowledge. Hardworking, good management plans. Work on charting legible notes.
Organized and good plans. Still need to focus on some presentations.
Overall good shift. Saw lots of patients, good with [ultrasound]. Continue to improve your clinical exposure by seeing as many patients as possible and accepting feedback.
Selection of the “done, but needs attention” option seems to help establish a framework (i.e., a statement of strength and an area of improvement) for assessors to use when providing written comments. Words commonly appearing in “done, but needs attention” comments included the following: needs, consider, more, remember, follow-up, good, and hardworking or works hard.
In contrast, written comments associated with selection of the “done” checklist option carried a different connotation than those associated with the “done, but needs attention” option. These comments were longer in length and often lacked specific directions for improvement. The comments corresponding with “done” tended to focus mainly on positive aspects that reinforced the checklist option. Assessors selecting the “done” option used writing conventions (e.g., punctuation marks or capital letters) to provide emphasis, and they tended to focus on the resident in relation to others, suggesting more norm than criterion referencing.
On own initiative made several attempts to contact family of unstable patient. Good foresight!
Excellent shift—saw many complex patients. Ready to be a senior resident. Managed a difficult airway efficiently and impressively. Also managed a septic patient very well. Inserted a central line extremely well and appropriately managed a patient with multiple system failure. Impressive work!
Excellent consult skills. Organized, thorough, and reliable. Spoke with multiple consult and obtained approach clinically and relevant advice. EXCELLENT.
Beyond her level of functioning in the department. Excellent charting/assessments and reassessments. Great management/discharge/plans. No concerns. Great job! Very efficient for her level—trying to instruct medical student as well.
Functions extremely independently and well for PGY-1. Probably more like a PGY-2. Concise histories, good plans and treatment and disposition plans.
Written comments that elaborated on “done” often offered some observation or assessment, but the suggestions were stated more passively than those associated with “done, but needs attention.” These comments also frequently referenced some global aspect of the resident’s performance in a way that could help the resident achieve a higher level.
Could improve on assertiveness and confidence as he is a bit quiet.
Good job with DKA [diabetic ketoacidosis] presentation. Feel free to provide clinical significance for lab findings.
Proficient. Efficient. Reliable. Disposition plans earlier in management if possible.
Clear explanations. Ensured plan understood. Offered opportunity for questions. In future, give results of test/investigations before transitioning into discharge plans.
In their comments corresponding with “done,” many assessors appeared to focus on a resident’s current level of performance and the “next steps” he or she could achieve to move onward. In several instances, faculty assessors indicated that the resident was performing well by writing “no concerns” in the comment fields.
Our purpose was to examine the relationships that exist among checklist scores, task ratings, global ratings, and written comments to better understand how these assessment tools can be used to assess learning and facilitate feedback encounters. McMAP assessments function as a programmatic approach for evaluating medical competence in the workplace. McMAP assessments contain both scores (quantitative data) and narrative comments (qualitative data), which together optimize the way in which assessors can communicate information about residents.
Results from this study indicate that even when assessors are required to provide written comments, many of them (approximately 30%) find a way to avoid supplying such information. Despite the opportunities provided and the wide acceptance and appreciation of narrative comments, some assessors were still unwilling to supplement their scores with written comments. For a nearly a decade, the medical education community has known some of the reasons assessors are not universally willing to provide narrative comments, especially in situations where the comments may be negative14; however, despite the community’s best efforts to make assessments more accessible, some assessors still resist providing narrative comments.
Results from the logistic regression analyses highlight the relationships among checklist scores, task ratings, global ratings, and narrative comments. In the main logistic regression model, we noted that, interestingly, task ratings were significantly associated with the selection of the “done but needs attention” checklist option, but the global rating was not. These relationships may have occurred because a task rating captures performance on a specific task or with specific content, while a global rating is designed to provide a holistic view of the trainee. The medical education literature indicates that task sampling affects the reliability of scores as well as increases variability21; therefore, by allowing assessors the opportunity to provide both a task rating and a global rating, more specific information related to the resident’s competence for a particular task can be obtained.
Given that both of the logistic regression models indicated that the selection of the “done, but needs attention” checklist option was associated with balanced and formative comments, we believe that selection of the “done, but needs attention” option is a means of identifying residents who are developing and exhibiting noteworthy strengths and weaknesses throughout a single encounter or shift. The observation that assessors seemed to provide more balanced comments and/or more comments suggesting areas of improvement may be due to their use of these comments to explain or justify their selection of the “done, but needs attention” item.22 Another possible reason for this relationship may be that the comment itself indicated the task was not fully complete or that it was not completed well and, therefore, the checklist score also needed to reflect this.23
We argue that investigating relationships between quantitative and qualitative assessment data is necessary to determine how different assessments can be best used to provide a robust and authentic depiction of a resident’s performance in the workplace. Our findings suggest that assessors seem to provide balanced comments only when they identify a deficiency in a resident’s performance; there are several possible explanations for this. First, pinpointing specific areas of improvement in a strong or satisfactory individual is often difficult because high-performing individuals tend to do well regardless of whether an assessor gives them written comments. Second, postgraduate residency programs represent the last level of formalized training before unsupervised practice, and sometimes feedback from assessors relates to their own preferences24 rather than errors or deficiencies in a resident’s performance that would keep him or her from moving on to independent practice. Finally, assessors are often not trained on how to provide written comments or engage in feedback encounters with high-performing individuals. As shown with some of our qualitative findings, some faculty members assess residents in relation to specific benchmarks (e.g., stage of training): “Probably more like a PGY-2” or “Ready to be senior resident.” Further, in the postgraduate training environment, some assessors may perceive residents as colleagues, which makes commenting or offering formative feedback harder, especially if feedback is not part of the established culture.
This study has demonstrated that assessors are more likely to provide specifics about strengths and areas of improvement for weaker performances. Faculty developers need to make clinical competence committee members aware of the association between ratings and narrative comments and provide support for how to improve narrative comments for residents at all levels. Perhaps, for assessments in postgraduate training, “done, but needs attention” and “not done” should be the only two checklist options, thus freeing assessors from the notion that a written comment implies a glaring deficiency.
Our study is one of the first to examine associations among checklist scores, task ratings, global ratings, and narrative comments. More studies that explore relationships among various assessment methods are needed because we limited ours to checklist scores, task and global ratings, and specific aspects of narrative comments. In the current study, only one investigator assessed narrative comments using the modified CCERR.20 Our investigator (T.M.C.) has expertise in both education and emergency medicine; however, we cannot make any claims about how our findings would generalize across other postgraduate specialties. Finally, examining relationships across only two of the CanMEDS roles (i.e., communicator and collaborator) leaves some unanswered questions about the relationships that may exist within other CanMEDS roles.
Implications for future research and practice
The shift toward competency-based medical education, the push toward programmatic assessment, and the desire to assess trainees using the CanMEDS framework have inspired leaders of many medical institutions to examine and change their existing assessment programs. This study presents a starting point for program leaders to examine how different assessments fit together and to think about modifying their assessment systems. Exploring associations among various pieces of assessment data can lead to more carefully planned assessment programs that have the potential to encourage feedback dialogues and support decision making. Continuing to explore relationships among various assessment data may also stimulate faculty to think about the purpose and utility of various assessments. Understanding how various assessment data align is a precursor to examining broader associations among the CanMEDS roles. If medical educators can glean information about relationships among the various CanMEDS roles, then developing assessment programs that can assess a “competent physician”19 is also possible.
This study highlights some of the relationships that exist among checklist scores, task ratings, global ratings, and written comments. Our findings highlight that task ratings and global ratings are differentially associated with checklist options. Furthermore, analysis of written comments supports the notion of a “hidden code” used to communicate assessors’ evaluation of medical competence. This study has implications for how individuals should be interpreting information obtained from qualitative assessments and what effects successful achievement may have on the balance and utility of written comments.
1. Schuwirth LW, Van der Vleuten CP. Programmatic assessment: From assessment of learning to assessment for learning. Med Teach. 2011;33:478485.
2. Schuwirth LWT, van der Vleuten CPM. Hodges BD, Lingard L. Assessing competence: Extending the approaches of reliability. In: The Question of Competence: Reconsidering Medical Education in the Twenty-First Century. 2012:Ithaca, NY: Cornell University Press; 113130.
3. Ginsburg S, McIlroy J, Oulanova O, Eva K, Regehr G. Toward authentic clinical evaluation: Pitfalls in the pursuit of competency. Acad Med. 2010;85:780786.
4. Govaerts M, van der Vleuten CP. Validity in work-based assessment: Expanding our horizons. Med Educ. 2013;47:11641174.
5. Green ML, Holmboe E. Perspective: The ACGME toolbox: Half empty or half full? Acad Med. 2010;85:787790.
6. Gingerich A, Regehr G, Eva KW. Rater-based assessments as social judgments: Rethinking the etiology of rater errors. Acad Med. 2011;86(10 suppl):S1S7.
7. Plymale MA, Donnelly MB, Lawton J, Pulito AR, Mentzer RM. Faculty evaluation of surgery clerkship students: Important components of written comments. Acad Med. 2002;77(10 suppl):S45S47.
8. Ginsburg S, Eva K, Regehr G. Do in-training evaluation reports deserve their bad reputations? A study of the reliability and predictive ability of ITER scores and narrative comments. Acad Med. 2013;88:15391544.
9. Ginsburg S, Gold W, Cavalcanti RB, Kurabi B, McDonald-Blumer H. Competencies “plus”: The nature of written comments on internal medicine residents’ evaluation forms. Acad Med. 2011;86(10 suppl):S30S34.
10. Ginsburg S, Regehr G, Lingard L, Eva KW. Reading between the lines: Faculty interpretations of narrative evaluation comments. Med Educ. 2015;49:296306.
11. Bismil R, Dudek NL, Wood TJ. In-training evaluations: Developing an automated screening tool to measure report quality. Med Educ. 2014;48:724732.
12. Canavan C, Holtman MC, Richmond M, Katsufrakis PJ. The quality of written comments on professional behaviors in a developmental multisource feedback program. Acad Med. 2010;85(10 suppl):S106S109.
13. Dudek NL, Marks MB, Wood TJ, et al. Quality evaluation reports: Can a faculty development program make a difference? Med Teach. 2012;34:e725e731.
14. Dudek NL, Marks MB, Regehr G. Failure to fail: The perspectives of clinical supervisors. Acad Med. 2005;80(10 suppl):S84S87.
15. van der Leeuw RM, Overeem K, Arah OA, Heineman MJ, Lombarts KM. Frequency and determinants of residents’ narrative feedback on the teaching performance of faculty: Narratives in numbers. Acad Med. 2013;88:13241331.
16. Ginsburg S, van der Vleuten C, Eva KW, Lingard L. Hedging to save face: A linguistic analysis of written comments on in-training evaluation reports. Adv Health Sci Educ Theory Pract. 2016;21:175188.
17. Ten Cate O. The false dichotomy of quality and quantity in the discourse around assessment in competency-based education. Adv Health Sci Educ Theory Pract. 2015;20:835838.
18. Chan T, Sherbino J; McMAP Collaborators. The McMaster Modular Assessment Program (McMAP): A theoretically grounded work-based assessment system for an emergency medicine residency program. Acad Med. 2015;90:900905.
19. Frank JR, Snell L, Sherbino J. CanMEDS 2015 Physician Competency Framework. 2015. Ottawa, Ontario, Canada: Royal College of Physicians and Surgeons of Canada; http://canmeds.royalcollege.ca/uploads/en/framework/CanMEDS%202015%20Framework_EN_Reduced.pdf
. Accessed March 16, 2017.
20. Dudek NL, Marks MB, Wood TJ, Lee AC. Assessing the quality of supervisors’ completed clinical evaluation reports. Med Educ. 2008;42:816822.
21. Swanson DB, Clauser BE, Case SM. Clinical skills assessment with standardized patients in high-stakes tests: A framework for thinking about score precision, equating, and security. Adv Health Sci Educ Theory Pract. 1999;4:67106.
22. Sibbald M, de Bruin AB, Yu E, van Merrienboer JJ. Why verifying diagnostic decisions with a checklist can help: Insights from eye tracking. Adv Health Sci Educ Theory Pract. 2015;20:10531060.
23. Sebok SS. Understanding How Raters Communicate in the Context of Medical High-Stakes, Performance-Based Assessments [PhD dissertation]. 2015. Kingston, Ontario, Canada: Queen’s University; https://qspace.library.queensu.ca/handle/1974/13713
. Accessed March 24, 2017.
24. Apramian T, Cristancho S, Watling C, Ott M, Lingard L. Thresholds of principle and preference: Exploring procedural variation in postgraduate surgical education. Acad Med. 2015;90(11 suppl):S70S76.