Secondary Logo

Journal Logo


Assessing How Well Three Evaluation Methods Detect Deficiencies in Medical Students' Professionalism in Two Settings of an Internal Medicine Clerkship

Hemmer, Paul A. MD; Hawkins, Richard MD; Jackson, Jeffrey L. MD; Pangaro, Louis N. MD

Author Information


Undergraduate medical educators have an educational and societal obligation to ensure that their graduates possess the attributes of professionalism requisite for practicing medicine. 1,2 During clinical clerkships, clerkship directors rely on housestaff and faculty not only to model and teach appropriate attitudes and behaviors but also to evaluate the students' professionalism. 3,4 Instructors' descriptive evaluations represent the primary means for assessing professionalism, and clerkship directors place great emphasis on these evaluations. 5 However, little has been written about the effectiveness of evaluation methods during clerkships in identifying students with deficiencies in professionalism. 6,7 In addition, the shift toward ambulatory care education 8 has raised concern over the quality of the educational experience for students, 9 but no study has examined the evaluation of professionalism among students in this setting.

We previously demonstrated the predictive validity of our clerkship-evaluation process for identifying marginally-performing srtudents. 10,11 In the present study, we expanded our inquiry by comparing the detection of professionalism deficiencies using three evaluation methods—standard checklists, written comments, and comments from formal evaluation sessions—in ambulatory care and ward settings of an internal medicine clerkship.


The third-year internal medicine clerkship at The Uniformed Services University of the Health Sciences (USUHS) is a 12-week clerkship in which students rotate at two of six geographically separated hospitals. Since 1994, the clerkship has consisted of a six-week ambulatory care rotation and a six-week inpatient ward rotation. 12 During ward rotations, the instructors for each student are at least one to two interns, one resident, an attending physician, and a preceptor (a staff physician who works with the same three to five students for six weeks as a small-group tutor). While there is some variability in the structures of the ambulatory care rotations at different sites, students work with a preceptor and typically four to six core general and/or subspecialty clinic attending physicians. Each student works with each core clinic attending physician for an average of four to five halfday clinics (range, three to ten).

During the clerkship, all instructors complete a standard evaluation form that consists of a checklist and a space for written comments. The checklist uses a five-point rating scale with overall ratings of “outstanding,” “above average,” “acceptable,” “needs improvement,” and “unacceptable.” Within each of the 15 rated performance categories (which cover the breadth of students' knowledge, skills, and professionalism), written, behaviorally-based descriptors anchor each level of a student's performance. In addition, all instructors participate in formal evaluation sessions, 13 which take place every three to four weeks during the clerkship at all clerkship sites. An onsite clerkship coordinator facilitates these sessions and also makes notes of the instructors' comments. The onsite clerkship director provides private and specific feedback to each student the following day. In our clerkship, an instructor's descriptive evaluation of a student's performance is based on the student's progress from being a “reporter” to being an “interpreter” to being a “manager/educator.” 14 Mastery at each level requires the student to possess and demonstrate competency across the domains of knowledge, skills, and professionalism. Following the clerkship, the Department of Medicine Education Committee (DOMEC) 11 reviews the performance of any student who has not met curricular requirements. These include students who do not pass the National Board of Medical Examiners subject examination in medicine at the end of the clerkship and those who are identified by instructors' comments. The DOMEC decides each reviewed student's final clerkship grade and, if indicated, the level of required remediation.

In mid-1998, the clerkship records of all students (n = 36) required by the USUHS DOMEC to perform some level of remediation following their core third-year medicine clerkship during the three academic years from 1994 to 1997 were independently assessed by two reviewers (PH, RH) for deficiencies in professionalism (attitude and/or demeanor). Eighteen students received unsatisfactory clerkship grades (C-, D, or Fail) and were required to remediate due to deficiencies in professionalism (remediation involved repeating part or all of the third-year medicine clerkship and/or completing an internal medicine subinternship during the fourth year of medical school). These 18 students represent 3% of all students from the reviewed academic years.

Using grounded-theory qualitative methods, we (PH, RH) abstracted the records of these 18 students. Specifically, we reviewed and coded the written record of each student's entire 12-week clerkship performance, consisting of the ratings and written comments on the evaluation forms from instructors and the notes taken at the formal evaluation sessions during the clerkship. For each evaluation method, we used the final evaluations from all instructors who worked with the student during both the ambulatory care and ward rotations of the clerkship. From the checklists, we recorded each instructor's rating, on the five-point scale, from each of the six domains of professionalism listed on the form: reliability and commitment; response to instruction; self-directed learning; patient interactions; response to stress; working relationships.

Through review of the evaluation forms and evaluation session notes, we (PH, RH) recorded those written comments and formal evaluation-session comments made by instructors that pertained to a student's professionalism. Using the evaluation form checklist's overall ratings (e.g., “outstanding”) and behaviorally-based descriptors as references, we coded the comments by consensus into one of the evaluation form's six professionalism domains and then rated each comment on a scale of 1–5. If an instructor made several comments concerning one domain, we recorded the lowest-rated comment. If there was no comment about professionalism, we construed this as satisfactory performance. As a result of this process, we identified the number of distinct written and evaluation session comments concerning professionalism made by instructors about each student.

When an on-site clerkship director is also an instructor, he or she makes but does not record (due to time constraints) comments about each student at the formal evaluation sessions. Thus, in the recording and analysis of evaluation-session comments for four students on the ambulatory care rotation, we assumed that at the evaluation sessions, the on-site directors would have identified at least the same number of domains and made at least the same number of spoken comments as their written comments.

For each evaluation method, we calculated a detection index (DI), which is the percentage of “needs improvement” or “unacceptable” ratings by all instructors across the six professionalism domains. (For example, if six instructors using a given evaluation method each rated a student across the six domains of professionalism, there would be a total of 36 professionalism ratings. If out of those ratings, nine were less than acceptable, the DI for that student using that method would be 9 ÷ 36, or 25%.) Continuous variables were analyzed using Student t tests; categorical variables were analyzed using chi-square analysis or Fisher's exact test, as appropriate.


The clerkships of 15 of the 18 students reviewed consisted of both ambulatory care and ward rotations. Three students' clerkships were 12 weeks of inpatient ward medicine only (because of scheduling limitations, the ambulatory care rotation was not available for those three students). The evaluation-form completion rates were identical on ambulatory care and ward rotations, at 93%. Attendance at the formal evaluation sessions was significantly higher on ward rotations than on ambulatory care ones (89% versus 57%, p = .001, chi-square analysis). The mean number of evaluators per student was also greater during ward than during ambulatory care rotations (7.7 versus 6.3, p = .003, two-sample t test with equal variances). In both the ambulatory care rotation and the ward rotation, a significantly greater percentage of instructors completed the evaluation form than attended the evaluation session (ambulatory care, 93% versus 57%; ward, 93% versus 89%, p < .05, chi-square analysis).

The most commonly cited domains for deficiencies in professionalism were similar in ambulatory care and ward rotations: reliability and commitment (RC), response to instruction (RI), and working relationships (WR). Examples of the types of deficiencies noted include RC: failure to follow up on daily patient care issues; failure to comply with clerkship requirements for written work; RI: inadequate response to repeated feedback on a core issue; and WR: argumentative; immature.

For each evaluation method, the detection index from the ward rotation was significantly higher than that from the ambulatory care rotation (Figure 1). For all three evaluation methods, instructors were twice as likely to identify deficiencies in professionalism for students on the ward as on the ambulatory care rotation (evaluation form checklist: odds ratio [OR] 2.2; 95% confidence interval [CI], 1.5–3.4—written comments: OR 2.2; 95% CI, 1.5–3.4—evaluation session: OR 1.9; 95% CI, 1.3–2.9). Of the 18 students who were required to remediate, all had been identified during their ward rotations. Of the 15 students who performed both ambulatory care and ward rotations, four students on the ambulatory care rotation were not identified as having deficiencies in professionalism by any evaluation method (100% versus 73% detected; p = .03). Two additional students on the ambulatory care rotation each received only a single low checklist rating (from an attending physician not present at the evaluation session) or a single, vague written comment.

Figure 1
Figure 1:
Detection index (DI) scores for ambulatory care and ward rotations grouped by three evaluation methods applied to 18 students on those rotations from 1994 to 1997 at The Uniformed Services University of the Health Sciences. For each evaluation method, the DI score is the percentage of all instructors' less-than-acceptable ratings of a student across the six professionalism domains (e.g., “reliability and commitment”). The figure shows that for each evaluation method, the DI from the ward rotations was significantly higher than that from the ambulatory care rotations (p < .001 for the checklist and written approaches, p < .01 for the evaluation session approach). This means that instructors were twice as likely to describe deficiencies in professionalism during the ward rotation.

Figure 2 uses the same data as shown in Figure 1 but groups the detection indices for each evaluation method by the inpatient ward and ambulatory care rotations. In each setting, instructors identified deficiencies in significantly more professionalism domains by their comments at the formal evaluation sessions than by using either the evaluation-form checklist (ambulatory care: OR 1.9; 95% CI, 1.1–3.1; p < .02—ward: OR 1.7; 95% CI, 1.3–2.2; p < .01) or their written comments (ambulatory care: OR 1.9; 95% CI, 1.1–3.1; p < .02—ward: OR 1.6; 95% CI, 1.2–2.1; p < .01). In contrast, there was no difference in the likelihoods of a deficiency notation between written checklists or written comments in either the ambulatory care or inpatient settings (ambulatory care: OR 1.0; 95% CI,.6–1.7; p = .89—ward: OR 1.0; 95% CI,.8–1.4; p = .77). For written and evaluation-session comments, a greater percentage of the instructors completing the specific evaluation method identified deficiencies in professionalism for students on the ward rotations than made such identifications on the ambulatory care rotations (Table 1). The mean numbers of comments per instructor and per professionalism domain were similar on both types of rotations (Table 1). Of note, nearly half of the written and evaluation-session comments on the ambulatory care rotation were made on only two students.

Figure 2
Figure 2:
Detection index (DI) scores from three evaluation methods grouped by the ambulatory care and ward rotations in which they were applied to 18 students in 1994–1997, The Uniformed Services University of the Health Sciences. For each evaluation method, the DI score is the percentage of all instructors' less-than-acceptable ratings of a student across the six professionalism domains (e.g., “reliability and commitment”). The figure shows that in each type of rotation, instructors identified deficiencies in significantly more professionalism domains by using the evaluation-session comments than by using the other two evaluation methods (p < .02 for the ambulatory care rotations, p < .01 for the ward rotations), and that there was virtually no difference between the likelihoods of describing a deficiency by means of the evaluation form checklist and by written comments in either setting.
Table 1
Table 1:
Data on Sources and Numbers of Comments from Two Evaluation Methods Used to Identify Deficiencies in Professionalism for 18 Students in the Two Rotations of the Internal Medicine Clerkship, USUHS, 1994–1997*

Regarding comments on professionalism, 72 instructors from both the ambulatory care and the ward rotations made both written and evaluation-session comments that indicated deficiencies in professionalism. An additional 15 instructors made written comments only. Among these 15, six (40%) did not attend the evaluation session, and nine (60%) did not identify deficiencies in their evaluation-session comments. Furthermore, 36 instructors made evaluation-session comments only. Eight (25%) did not complete an evaluation form and 28 (75%) did not identify deficiencies in professionalism in their written comments. In addition, of these 28 instructors who did not make written comments, 22 did not identify professionally deficient students with the checklist.


Recent methods for evaluating professionalism in undergraduate medical education have focused on the use of specific evaluation forms for identifying and subsequently investigating deficiencies in professionalism. 6,7 While these methods enhance “specificity” through detailed documentation and, if necessary, academic action, their ability to enhance the “sensitivity,” or likelihood of identifying deficiencies in professionalism, is unproven. To date, there has not been a study comparing evaluation methods or clinical settings for the identification of deficiencies in professionalism.

In our present study, instructors were twice as likely to identify students with deficiencies in professionalism during the ward rotation as during the ambulatory care rotation of our third-year internal medicine clerkship. All 18 deficient students were identified during their ward rotations. In contrast, six of the 15 (40%) students with deficiencies in professionalism who had an ambulatory care component to their clerkships either were not identified or were identified only by a single notation by one instructor. In both ambulatory care and ward settings, formal evaluation sessions yielded the highest detection rate of all three evaluation methods. This increase was particularly striking in the ambulatory care rotation, given that significantly fewer instructors attended the formal evaluation sessions than completed the evaluation forms.

There may be several reasons for the enhanced identification of these students in the inpatient portion of the clerkship. First, the ward setting is a more stressful working environment, with overnight call responsibilities, a variety of educational and clinical demands, and the responsibility for caring for complicated patients who are acutely ill. Also, there are very close working relationships with the team members. Consequently, a number of evaluators are able to view the student's performance in a variety of circumstances and on a daily basis for four to six weeks. On the other hand, in the ambulatory care component of our particular clerkship, students work in several clinics and have a given instructor only once or twice weekly. This discontinuity of observation may impair an instructor's ability to identify students with subtle or less obvious problems. In fact, nearly half of the comments made by the ambulatory care instructors were made on only two students who had serious difficulties. Alternatively, it is entirely possible that students behave differently when they spend the majority of their time with attending physicians.

We believe there are also several reasons why formal evaluation sessions yielded higher deficiency detection than did written checklists or comments. First, individual instructors' observations can be corroborated by those of others, and previously discounted actions of students may then assume added significance. While one might be concerned that instructors' observations could be biased by comments made in a group-evaluation setting, the on-site clerkship directors are trained to be watchful for such an occurrence. In fact, our findings would suggest that such bias does not occur, because, as noted in Table 1, there was not complete agreement among evaluators in either written or verbal comments about deficiencies in a student's professionalism. However, having more instructors relate their observations and comments, as was achieved with the evaluation sessions, is an essential part of substantiating academic decisions.

Second, instructors may be more willing to discuss those areas of concern that they hesitate to write down. The finding that 28 instructors—nearly a fourth of those making some type of identifying comment—made comments only during evaluation sessions underscores this point.

Third, formal evaluation sessions provide an ongoing forum for “case-based,” “real-time” faculty development, thereby improving goal-based evaluation and teachers' confidence. Since formal evaluations occur several times during the clerkship, we can ensure that instructors understand students' behaviors, reinforce the need for modeling and teaching professionalism, emphasize that professionalism is a core component of competence, and develop a plan for addressing any deficiencies with the student. These benefits may explain that while the mean numbers of written and evaluation-session comments per evaluator or per cited domain were similar (Table 1), the evaluation-session comments were made by more instructors and applied across significantly more domains of professionalism and thus yielded more detailed descriptions of the students' deficiencies.

There are several limitations to our study. First, we reviewed only students whose deficiencies in professionalism were of such magnitude that, in the estimation of the DOMEC, the students required remediation. Other students identified during the clerkship as having deficiencies but responding appropriately to feedback are not reviewed by the DOMEC. Hence, our pool represented a particular spectrum of students with clear professionalism problems. Further qualitative studies to assess whether our findings hold for identifying students with less extreme problems are warranted.

Second, we assumed that students identified during the inpatient component of the clerkship were “true positives”—that they had deficiencies in professionalism that should have been detected during the ambulatory care component. We feel this assumption is justified, given the previously established predictive validity of our clerkship evaluation process. 10,11

Third, the detection index used in this study looks at only the percentage of professionalism domains in which students' behaviors were rated less than acceptable. Further analysis would be helpful to determine qualitative similarities or differences between written and evaluation-session comments.

Fourth, the professionalism domains we studied were limited to the checklist items on our clerkship-evaluation form. However, we feel these domains sample the breadth of professional behavior and are similar to those cited by other organizations. 2,15

Fifth, the data abstractors (PH, RH) were not blinded to the instructors' checklist ratings or the students' final clerkship grades. Nonetheless, independently categorizing and rating written and evaluation-session comments, and then arriving at consensus, reduced potential bias.

Sixth, it is not clear whether instructors who made only evaluation-session comments did not make written comments because they had made evaluation-session comments or because they were unwilling to make written comments. Regardless, it seems unlikely that these comments would have been captured had it not been for the evaluation sessions.

Seventh, we assumed that when onsite directors served as instructors, they identified at least the same number of domains and made as many verbal as written comments. In fact, this is likely to be an underestimation, since our experience with these evaluation sessions is that overall verbal comments exceed written ones both in quality 16 and in quantity.

Finally, the difference between the ward and the ambulatory care evaluation session DIs may, in part, be due to the lower attendance rate at the ambulatory care evaluation sessions. Barriers to attendance might include conflicts with scheduled clinical duties, lack of adequate notification, or scheduled days off. Based on the findings in this and previous studies, 10,11 we believe that improving attendance at the ambulatory care evaluation sessions can improve the detection of unprofessional behaviors in that aspect of the clerkship.

Despite these limitations, we believe the educational implications are clear: formal evaluation sessions enhance the identification of marginally performing students during clinical clerkships. 10,11,13,16

Beyond improving evaluation and providing faculty development, the evaluation sessions have an additional advantage in allowing feedback, intervention, and continued observation while the student is still on the clerkship—essential elements when making decisions for academic action. 17

In contrast, our detailed evaluation-form checklist had low detection indices in both clinical settings. Instructors may feel limited by the form's rating scale, they may restrict their use of the full scale, or their observations may not be adequately captured on the checklist, even one with behavior-based, written descriptors anchoring each level of a student's performance. Thus, while rating scales on evaluation forms can be an important means of communicating clerkship goals, relying on rating scales alone may be insufficient to identify atrisk students; instructors must have an opportunity to make descriptive comments. 18,19

Prior studies examining unprofessional behaviors among medical students used specific professionalism rating forms, which were felt to improve either the identification or the documentation of unprofessional behaviors—in part through the definition of traits to be evaluated, although these forms were not compared with other methods of evaluation. 6,7 Further limitations of these described methods include the apparent lack of direct conversations with instructors during the clerkship, basing the evaluation process on instructors' use of a summative, or final, clerkship-evaluation-form rating scale, delays in completing the evaluation form, and the fact that formal feedback to students did not occur until at least one such form had been received and reviewed, which could be after the student had left the clerkship. In contrast, it appears from this and prior studies 10,11 that even with lower attendance rates than evaluation-form completion rates, the investment in formal evaluation sessions is worthwhile because of their better sensitivity, their timeliness, the likelihood of identifying deficiencies in competence, and the generation of early, or formative, feedback to students.

An additional implication arises from our surprise at the low identification of professionalism deficiencies in the ambulatory care setting, especially since many of the ambulatory care attending physicians at our main teaching hospitals also had inpatient clerkship and residency teaching experience. Efforts at improving physicians' attendance at the ambulatory care evaluation sessions, such as avoiding conflicts with scheduled clinical duties, may be essential to enhancing the identification of professionalism deficiencies in this setting. Nevertheless, we believe our present findings reinforce a critical role for inpatient rotations in student evaluation and, until professionalism is studied in alternatively structured ambulatory care rotations, should sound a cautionary note for those clinical clerkships conducted entirely in ambulatory care settings and/or with less trained or experienced faculty.


1. Cohen JJ. Leadership for medicine's promising future. Acad Med. 1998;73:132–7.
2. MSOP Writing Group. Learning objectives for medical student education—guidelines for medical schools: report I of the Medical School Objectives Project. Acad Med. 1999; 74:13–8.
3. Mufson MA. Professionalism in medicine: the department chair's perspective on medical students and residents. Am J Med. 1997;103:57–9.
4. Stern DT. Practicing what we preach? An analysis of the curriculum of values in medical education. Am J Med. 1998;104:569–75.
5. Clerkship Directors in Internal Medicine, Evaluation Task Force Survey Results, 1996. <>, accessed 11/4/99. Clerkship Directors in Internal Medicine, Washington, DC.
6. Phelen S, Obenshain SS, Galey WR. Evaluation of the noncognitive professional traits of medical students. Acad Med. 1993;63:799–803.
7. Papadakis MA, Osborn EH, Cooke M, Healy K. A strategy for the detection and evaluation of unprofessional behavior in medical students. Acad Med. 1999;74:980–90.
8. Stagnaro-Green A, Packman C, Baker E, Elnicki DM. Ambulatory education: expanding undergraduate experience in medical education. Am J Med. 1995;99:111–5.
9. Woolliscroft JO, Schwenk TL. Teaching and learning in the ambulatory setting. Acad Med. 1989;64:644–8.
10. Hemmer PA, Pangaro LP. The effectiveness of formal evaluation sessions during clinical clerkships in better identifying students with marginal funds of knowledge. Acad Med. 1997;72:641–3.
11. Lavin B, Pangaro L. Internship ratings as a validity outcome measure for an evaluation system to identify inadequate clerkship performance. Acad Med. 1998;73:998–1002.
12. Pangaro L, Gibson K, Russel W, Lucas C, Marple R. A prospective, randomized trial of a six-week ambulatory medicine rotation. Acad Med. 1995;70:537–41.
13. Noel GA. System for evaluating and counseling marginal students during clinical clerkships. J Med Educ. 1987;62:353–5.
14. Pangaro L. A new vocabulary and other innovations for improving descriptive in-training evaluations. Acad Med. 1999;74:1203–7.
15. Project Professionalism. Philadelphia, PA: The American Board of Internal Medicine, 1995.
16. Pangaro LN, Hemmer PA, Gibson KF, Holmboe E. Formal Evaluation Sessions Enhance the Evaluation of Professional Demeanor. Paper presented at the Eighth International Ottawa Conference on Medical Education and Assessment, Philadelphia, PA, July 1998.
17. Irby DM, Milam S. The legal context for evaluating and dismissing medical students and residents. Acad Med. 1989;64:639–43.
18. Tonesk X. Clinical judgement of faculty in the evaluation of clerks. J Med Educ. 1983; 58:213–4.
19. Hunt DD. Functional and dysfunctional characteristics of the prevailing model of clinical evaluation systems in North American medical schools. Acad Med. 1992;67:254–9.
© 2000 Association of American Medical Colleges