Reliable and valid assessment of medical students’ and residents’ clinical competence is an ongoing challenge for medical educators.1–3 Despite recognition of the importance of direct observation,4–7 clinical teaching faculty rarely use it as a basis for clinical assessment.8,9 Instead, there typically is heavy reliance on indirect sources such as case discussion and note review. Data indicate that students are directly observed by a faculty member performing a history and physical examination four or fewer times through the whole course of medical school training.10
The literature documents numerous benefits of direct faculty observation of trainee physicians’ encounters with patients. Direct observation provides data for ongoing formative evaluation and feedback, which are critical to reinforce appropriate learning of clinical skills and to correct deficiencies.11 Medical students’ improvement of skills in problem formulation and mental status examination have been shown to be directly attributable to direct observation and also perceived by students as important.12 Direct observation also has been shown to assist supervisors to be more specific and detailed when completing summative evaluations.12 However, studies examining the relationship between direct observation and reliable and valid assessment of trainee physicians’ clinical competence have been limited in number and tentative in results.6
The purpose of this research was to study the influence of direct observation on reliability and validity evidence for clinical performance ratings of family medicine clerks.
Setting, Subjects, and Instrument
During the 2002–03 academic year, all third-year medical students (n = 172) at the University of Illinois at Chicago College of Medicine attended a required six-week family medicine clerkship. At the end of the clerkship, preceptors completed a newly developed written assessment tool to evaluate students’ performance. The instrument consisted of 16 behavior-based items rated on a five-point Likert scale, ranging from 1 = almost never to 5 = almost always. Preceptors were required to indicate the basis for their ratings of each student on each item by checking any or all of three possible data-sources: note review, case discussion, and/or direct observation.
All preceptors who supervised a student for at least four half-days during the six-week experience completed the evaluation form at the end of the clerkship. For each student, a clerkship clinical score was computed by calculating the mean of all preceptors’ ratings, weighted by the number of half-days spent with each student. The final grade for the clerkship was based on the clinical score and students’ end-of-clerkship National Board of Medical Examiners (NBME®) subject examination score.
At the beginning of their fourth year, students took an objective structured clinical examination (OSCE) consisting of eight standardized patient (SP) based stations. Each station consisted of a 15-minute encounter with the SP, followed by a ten-minute interval during which the SP and the student completed their respective postencounter tasks. Finally, SPs provided each student with five minutes of verbal feedback regarding their communication and interpersonal skills. Student post-encounter tasks were completed using components of a computer-based clinical competency examination program, which is designed to assess students’ clinical reasoning. Students received a composite clinical competence score that reflected all components of the exam.
Study Design and Research Questions
This research was conducted using a retrospective study design. Institutional Review Board approval for ethical conduct of human subjects research was obtained prior to initiating the study.
The following research questions guided the study:
- Do ratings that include more direct observation as a data-source for evaluating students’ clinical performance result in greater interrater reliability?
- Do ratings that include more direct observation as a data-source for evaluating students’ clinical performance influence the association between clerkship clinical ratings and other measures of students’ knowledge and clinical competence?
Data and Analysis
The following variables were used for the analyses:
- Individual scores for all 16 items for each student's first rater (for internal consistency analysis).
- Mean scores for all 16 items for each student's first three raters (for interrater reliability analyses).
- NBME Subject Examination scores (criterion variable for validity analyses).
- Fourth-year SP-based clinical competence examination scores (criterion variable for validity analyses).
- Clerkship clinical scores from all preceptors (predictor variable for validity analyses).
- Data source score (stratifying variable for subgroup analyses). This variable was computed based on the first rater's (Rater A) bases for rating each item (note review, case discussion, direct observation). The selection of Rater A was made for two reasons: (1) approximately one-third (33%) of the students only had one rater; and (2) on average, Rater A had had the most contact with the students as evidenced by the average number of half days spent with the student. Based on the perceived importance of the three observational methods, each data-source was given an arbitrary weight: note review (1), case-discussion (2), and direct observation (3). Preceptors could check more than one data-source; all unchecked data points were scored as zero. For example, for Item 1 on the rating form (student asks thoughtful, purposeful questions to obtain a relevant history), if a preceptor checked direct observation and note review as their bases for the rating, their data-source score for Item 1 would equal 4 (3 for direct observation plus 1 for note review). A data-source score for each item was computed by summing the points for all three data-source variables. Thus, each item could potentially get a data-source score ranging from 0 (if none of the three sources was checked) to 6 (if all three sources were checked). A mean data-source score was computed for all 16 items for each student's first rater. Three categories of data-source, low (0–2.94), medium (3–4.4), and high (4.5–6) were defined.
- Data from the first rater for each student was used to compute Cronbach's alpha to examine the internal consistency (scale reliability) of the 16 items.
- Mean scores for the 16 items for the first three raters for each student were used to compute intraclass correlation coefficients to examine the level of interrater reliability among the three raters.
- Pearson correlation coefficients were computed between the clerkship clinical scores and the NBME subject examination scores.
- Pearson correlation coefficients were computed between the clerkship clinical scores and the fourth-year SP-based clinical competence exam scores.
The analyses were first conducted using the complete data set and repeated after stratifying the data set by the three categories of data-source score (low, medium, and high). Disattenuated correlation coefficients were computed for overall as well as subgroup correlations. The disattenuated correlation estimates the “true score” correlation of the variables if the measures were free of error or perfectly reliable.13
Of the 172 clerks, 91 (52.9%) were male and 81 (47.1%) were female. Students attended one of 27 different clinical sites, with 126 (73.3%) students at residency sites and 46 (26.7%) at nonresidency sites. The number of raters for students ranged from 1–6; 56 (33%) of the students had only one rater and most students (77%) had no more than three raters. Most of the students at residency sites had multiple raters and those at nonresidency sites generally had only one rater. Preceptors spent varying amounts of time with the students. On average, Rater A spent the most number of half-days with the students, 11.53 half-days compared with an average 5.67 and 5.76 half-days for Raters B and C, respectively.
Cronbach's alpha for internal consistency of the 16 items was .93 (n = 121). The intraclass correlation coefficient for agreement among three raters was .36 (n = 82). The correlation between clerkship clinical scores and NBME subject exam scores was positive, but not significant (r = .141, p = .070). The correlation between clerkship clinical scores and fourth-year SP-based clinical competence exam scores was also positive and nearly reached significance (r = .159, p = .058).
The results of the interrater reliability analyses stratified by three categories of data-source score are presented in Table 1. Because of missing data for the data-source score variable, a total of 59 ratings were available for use for the subgroup analyses. Interrater reliability increased as a function of data-source score. Interrater reliability for the group with low data-source scores (n = 24) was .29, for the group with medium data-source scores (n = 17) was .50, and for the group with high data-source scores (n = 18) was .74. Disattenuated correlation coefficients for all analyses were higher than unadjusted correlation coefficients. These coefficients are simply an indicator of what the true correlation would be in the population if the measurements were perfectly reliable, thus helping to differentiate between low correlations due some measurement error or due to no true correlation at all.
Table 2 presents the results of the validity analyses. For the groups with low and medium data-source scores, there were no significant correlations between clerkship clinical scores and NBME subject examination scores as well as between clerkship clinical scores and fourth-year SP-based clinical competence examination scores. For the group with high data-source scores, there was a significant positive correlation between clerkship clinical scores and NBME exam scores (r = .311, p = .054); and also between clerkship clinical scores and the fourth-year SP-based clinical competence examination scores (r = .423, p = .009).
Evaluation of the complex construct of clinical competence requires assessment methods with sufficient reliability and validity evidence to allow meaningful interpretations of data obtained. The clinical rating forms that are commonly utilized by teaching faculty to judge the clinical competence of medical students and residents are fraught with a number of problems,2,3,14–16 not the least of which is that data for such ratings are seldom based on direct observation.8,9
This study demonstrated that the overall quality and meaningfulness of ratings of clinical performance are improved when preceptors include more direct observation as a data-source. We found poor agreement among raters when data-source was not taken into account. The high level of rater disagreement may have been due, as others have suggested,13,17,18 to the fact that standards for judging clinical competence were not explicit. The fact that rater agreement increased as a function of data-source score may suggest that the requirement that preceptors indicate the bases for their ratings helped make preceptors more aware and thoughtful about the standards for their ratings. This idea warrants further study.
Since the ultimate purpose of clinical assessments is to predict future performance, evaluation processes must demonstrate their efficacy for making inferences about future performance. In this study, we found that when data-source was not taken into account, the clerkship clinical scores were a poor predictor of students’ performance on both the end-of-clerkship NBME subject examination and the fourth-year SP-based clinical competence examination. On the other hand, when data-source was taken into account, for the high data-source group, students’ clerkship clinical scores were positively associated with both the NBME subject examination scores and the fourth-year SP-based clinical competence examination scores. This finding strongly suggests, then, that when a rating form includes bases for preceptors’ ratings, the ratings of those preceptors who use multiple sources, including more direct observation, are more reliable and valid. It may be that encouraging preceptors to employ multiple bases, including direct observation, for making judgments about trainees is a more feasible approach to addressing interrater reliability issues than requiring a minimum of seven raters per trainee, as has been suggested in the literature.19
Finding the time to directly observe trainee physicians remains a challenge, given competing service and fiscal demands on clinical faculty's time.20,21 This study highlights the importance of developing practical strategies for making direct observation “doable.” Research, for example, suggests that it is possible to make judgments about whether or not a student or resident is conducting a purposeful and hypothesis-driven inquiry by only evaluating the first three to five minutes of a history-taking encounter.22,23 Further, raters can be relatively easily trained to recognize this competence. Perhaps requiring that preceptors directly observe only selected skills would address the issue of feasibility, which is the most frequently cited limitation of requiring direct observation. It may even be that this approach would prove more time efficient than listening time and again to inefficiently presented case presentations in order to make indirect—and inevitably less reliable—judgments about skills that really can only be adequately evaluated by direct observation.
This study has certain limitations. First, the three data-source groups were defined using only the first rater's scores, though this choice was reasonable as about one-third of the students only had one rater, and on average the first rater also had the most contact with the students. Second, the number of subjects for the subgroup analyses was smaller than that of the overall analyses, thus limiting the generalizability of the findings. Replication of this study with larger samples is recommended. Additional studies also are needed to address issues of feasibility regarding the use of direct observation, including training methods to improve the direct observation skills of faculty. From a practical standpoint, since it is not possible to ask all raters to directly observe all students, the ratings of those preceptors who include direct observation might be given more weight than those who rely only on more typical means, such as note review and case discussion.
By demonstrating reliability and validity evidence for clinical ratings that include more direct observation, this study provides a strong argument for developing strategies to ensure that direct observation of clinical behaviors that cannot easily be evaluated without direct observation, like history-taking and physical examination skills, become a routine part of formative and summative clinical evaluation systems.
1. Landon BE, Normand SL, Blumenthal D, Daley J. Physician clinical performance assessment: prospects and barriers. JAMA. 2003;290:1183–9.
2. Williams RG, Klamen DA, McGaghie WC. Cognitive, social and environmental sources of bias in clinical performance ratings. Teach Learn Med. 2003;15:270–92.
3. Downing SM. Validity: on meaningful interpretation of assessment data. Med Educ. 2003;37:830–7.
4. Feinstein AR. Clinical Judgment. Baltimore: Williams & Wilkins, 1967:1-71, 291-349.
5. Engel GL. The deficiencies of the case presentation as a method of clinical teaching: another approach. N Engl J Med. 1971;284:20–4.
6. Wakefiled J. Direct observation. In: Neufeld VR, Norman GR (eds). Assessing Clinical Competence. New York: Springer Publishing Company, 1985;51–70.
7. Holmboe ES. Faculty and the observation of trainees’ clinical skills: problems and opportunities. Acad Med. 2004;79:16–22.
8. Stillman P, Swanson D, Regan MB, et al. Assessment of clinical skills of residents utilizing standardized patients: a follow-up study and recommendations for application. Ann Intern Med. 1991;114:393–401.
9. Kassebaum DG, Eaglen RH. Shortsomings in the evaluation of students’ clinical skills and behaviors in medical schools. Acad Med. 1999;74:842–9.
10. Steege/Thompson Communications. Clinical skills exam: frequently asked questions 〈www.usmle.org
〉. Accessed 5 February 2003.
11. Holmboe ES, Williams FK, Yepes M, Norcini JJ, Blank LL, Huot S. The mini clinical evaluation exercise and interactive feedback: preliminary results. J Gen Intern Med. 2001;16(suppl 1):100.
12. Links PS, Colton T, Norman GR. Evaluating a direct observation exercise in a psychiatric clerkship. Med Educ. 1984;18:46–51.
13. Muchinsky PM. The correction for attenuation. Educ Psychol Meas. 1996;56:63–75.
14. Herbers JE Jr, Noel GL, Cooper GS, Harvey J, Pangaro LN, Weaver MJ. How accurate are faculty evaluations of clinical competence? J Gen Intern Med. 1989;4:202–8.
15. Metheny WP. Limitations of physician ratings in the assessment of student clinical performance in an obstetrics and gynecology clerkship. Obstet Gynecol. 1991;78:136–41.
16. Kreiter CD, Ferguson K, Lee WC, Brennan RL, Densen P. A generalizability study of a new standardized rating form used to evaluate students’ clinical clerkship performances. Acad Med. 1998;73:1294–8.
17. Noel GL, Herbers JE Jr, Caplow MP, Cooper GS, Pangaro LN, Harvey J. How well do internal medicine faculty members evaluate the clinical skills of residents? Ann Intern Med. 1992;117:757–65.
18. Hasnain M, Onishi H, Elstein AS. Inter-rater reliability in judging errors in diagnostic reasoning. Med Educ. 2004;38:609–16.
19. Carline JD, Paauw DS, Thiede KW, Ramsey PG. Factors affecting the reliability of ratings of students’ clinical skills in a medicine clerkship. J Gen Intern Med. 1992;7:506–10.
20. Levinson W, Branch WT Jr, Kroenke K. Clinician Educators in academic medical centers: a two-part challenge. Ann Intern Med. 1998;129:59–64.
21. Levinson W. Rubenstein A. Integrating clinician-educators into academic medical centers: challenges and potential solutions. Acad Med. 2000;75:906–12.
22. Chang RW, Bordage G, Connell KJ. The importance of early problem representation during case presentations. Acad Med. 1998;73(suppl 10):S109–11.
23. Hasnain M, Bordage G, Connell KJ, Sinacore JM. History-taking behaviors associated with diagnostic competence of clerks: an exploratory study. Acad Med. 2001;76(suppl 10):S14–7.