Kreiter, Clarence D. PhD; Ferguson, Kristi J. PhD
The preceptor's rating of a medical student's performance is the most widely used method for assigning grades in the clinical clerkship1 and plays important roles in composing the dean's letter and in residency selection.2 Because of the importance of preceptors' ratings in graduate and undergraduate medical education, these clerkship evaluations should be as valid and reliable as possible. During the 1999–00 academic year at the University of Iowa College of Medicine, over 600 raters and preceptors in eight clerkships filled out 6,291 standardized clinical evaluation forms (CEFs) used for rating the performances of third-year medical students. The standardized form consisted of 12 items to assess three broad skill areas, each with an identical five-point rating scale (1 = unacceptable to 5 = outstanding). Preceptors filled out an average of 31.1 CEFs per student during the clinical clerkship year, and 513 (8.2%) CEFs were submitted with 5, the highest possible rating, selecting for all 12 items on the form. The validity of CEFs coded with the same positive extreme value for all items, or a straight-line five (SL5) response, is unknown. Social science researchers have questioned the validity of questionnaires or rating forms displaying a uniform extreme response format. Oskamp3 labeled forms that “answer all items in the same way” as an example of carelessness and suggested such forms usually be discarded. This study investigated whether the SL5 response pattern observed on the CEF might be attributable to rater carelessness.
Response sets and response biases, both a source of systematic error in the measurement process, have been the subject of considerable research. A response set is defined as (1) a reliable source of variance in individual differences that (2) is an artifactual product of the measurement method and (3) is at least partially independent of the trait the measurement method is intended to measure.4 Examples of a response set as displayed on surveys or rating forms include phenomena such as marking most items at the extremes, responding in the most socially acceptable fashion, or agreeing with items regardless of content. Research on the response set phenomenon has focused on documenting socially desirable responses,5 acquiescence,6,7 and extreme responding8,9 as sources of measurement error emanating from respondents' personal characteristics rather than the content of the measurement instrument.
While also contributing to measurement error, response bias differs from a response set in that it does not generate reliable individual differences, but rather has a systematic and reliable impact on group means. When it affects groups differently, response bias significantly threatens the validity of psychometric measurements and their interpretation within research studies comparing groups' performances. There are as many examples of this phenomenon as there are research designs, and any study comparing two or more groups is potentially vulnerable to its influence. Response biases can arise from sources such as variation in the written or oral instructions provided to raters, social-expectation pressures on respondents, unintentional demand characteristics related to an experimental treatment, and other aspects of the measurement environment.
Carelessness may or may not be regarded as a response phenomenon distinct from a response set or response bias and, depending on the circumstances, can contribute to either systematic or random error in the measurement process. For example, a respondent or rater who randomly marks items, or marks the same option for all items on a counterbalanced measurement instrument, would likely contribute random error and not produce a reliable effect on either individual or group measures. However, a careless responder could act in a nonrandom fashion, producing a reliable effect like that of a response set or response bias. This can occur when a rater or respondent marks one extreme end of the score scale for each item on a form not employing a counterbalanced design. Such a response pattern tends to increase the within-form reliability or internal consistency and produce a reliable rater or individual effect like that of a response set. In fact, some researchers have characterized carelessness as a response set,10 and Oskamp3 states that carelessness should be regarded as a response set because it displays “systematic ways of answering which are not directly related to question content, but which represent typical behavioral characteristics of the respondent.” Further, when a systematic difference in the levels of carelessness exists between two groups of responders, carelessness may produce a different and systematic impact on group means similar to that of a response bias. This can happen in situations in which careless responses tend to be uniform. For instance, if raters are unwilling or unable to perform adequate observations, they might award high scores to avoid penalizing the examinee or lessen the likelihood of having to justify the score to the student. When nested within groups, the variation in raters' conscientiousness may contaminate group comparisons. Despite the difficulty in classifying carelessness—to the extent it functions as a response set or response bias, or contributes random error—individual response measures or group means will not reflect the intent of the measurement instrument. This threatens validity and warrants an investigation to detect and minimize its impact.
In research on an earlier 19-item version of the CEF, a generalizability analysis of ratings produced a very small effect for the item facet and a high level of internal consistency within the form.11 A principal-component analysis of the items showed they were essentially unidimensional (one eigenvalue greater than one), a finding consistent with the decision phase of the research demonstrating a very minor impact on reliability with a substantial reduction in the number of items. Either preceptors failed to discriminate among the 19 items on this form and allowed a “halo effect” to influence the ratings, or the assessed dimensions of clinical performance were very highly correlated. Since the form's items were designed to measure three somewhat independent constructs of clinical competence—interpersonal skills, clinical skills, and professional attributes—it seems unlikely that a high correlation among the constructs might explain these results. Despite attempts to measure different aspects of performance with item sub-groups on the CEF, obtained ratings on individual items appear to reflect the preceptor's global impression of the student. This does not, however, imply such ratings lack validity. When presented with the unstandardized conditions that characterize performance ratings for clinic clerkships (varying levels of observation opportunity and variability in patient populations and situational factors), raters may conclude global impressions are more defensible than are the behaviorally specific ratings requested by the items. Indeed, the validity of these ratings was supported in a recent study12 showing mean CEF scores moderately correlated with performances on clinical knowledge tests and postgraduate clinical competence ratings.
The SL5 response format observed on the CEF may not be valid if the responses arise from raters' carelessness, a lack of commitment to the rating process, or a misunderstanding of the coding procedures. Alternately, the SL5 responses could represent valid observations. First, as discussed, the ratings on all the CEF items might represent a global impression of performance rather than an assessment of specific behaviors. Hence, an SL5 response may represent a very positive overall evaluation of the student. Another explanation arises from the distribution of item ratings for the 1999–00 academic year, which suggests a strong “ceiling effect” in the item rating scale (see Figure 1). Although the ceiling effect cannot statistically explain the observed high proportion of SL5 forms, it certainly does increase the expected proportion of valid SL5 responses when evaluating highly talented students. Deleting SL5 forms that result from either of these two reasons would not facilitate measurement precision. To the contrary, SL5 responses attributed to these phenomena would reflect a highly positive evaluation of the students receiving them and would provide useful assessment information.
We used a correlation analysis and a generalizability study to investigate the empirical validity of SL5 responses to the CEF. The correlation analysis provided information regarding the tendency for high-scoring students to receive SL5 ratings. The generalizability study addressed the impact of SL5 forms on the reliability of mean ratings of students across forms. If the direction, magnitude, and statistical test of the correlation coefficient suggest higher-scoring students receive more SL5 responses, this will support the conclusion that at least some CEFs with SL5s are valid. However, such a finding will not reveal whether a subset of SL5 responses might also reflect raters' carelessness and perhaps have a negative impact on score reliability. An analysis of score variance computed using the generalizability method and a standard statistical software package provides information about how the SL5 response format affects the overall reliability of mean CEF scores. For all analyses, the data consisted of evaluations of third-year students who had received 18 or more ratings during the clerkship year. This data set consisted of 168 students and 5,682 observations, 460 (8.2%) with SL5s.
Multiple steps were required in the correlation analysis to determine the relationship between students' mean scores and the proportion of SL5s. First, we computed the proportion of SL5 forms per student and converted it to a variable. Next, we removed the 460 SL5 forms from the data set and calculated a mean student CEF score across forms for each student. We correlated this mean student CEF score with the proportion of SL5 forms each student received. Then we conducted a statistical test of this correlation to generate the significance probability under the null hypothesis that the correlation was equal to zero.13
To gauge the overall impact of SL5s on the score reliability of the CEFs, we conducted generalizability studies to estimate variance components with and without SL5 forms. We used a raters-nested-within-person (r:p) model on three balanced stratified random samples of ten forms for each of the 168 students. We randomly selected forms using a programmed random selection method and a standard statistical software package. The first sample (Sample A) contained no SL5 form; Samples B and C included 8% and 13% randomly selected SL5 forms, respectively. Hence, Samples A, B, and C, the random sampling of 1,680 total forms was constrained to include 0, 134, and 230 randomly selected SL5 forms, respectively. The 8% of SL5s in Sample B equaled the proportion of the SL5s in the general population of CEF ratings. The 13% in Sample C (half the available SL5 forms) was included to study the impact of increasing the number of SL5 forms above that of the population. We computed the magnitude of components contributing to error variance and person variance for each sample and generated reliability or G coefficients.
The correlation coefficient between the proportion of SL5 forms per student and the mean across-form total score calculated without the SL5 forms was r = .48. This was significantly different from zero at p < .0001 (n = 168). Figure 2 displays the scatterplot and regression line of this relationship.
Table 1 shows the results of the three generalizability studies. The obtained variance components, standard errors, and G coefficients for 1, 10, and 30 observations are shown for Sample A. The G coefficient for a single observation is .11; the G coefficient for ten, the observed reliability of our sample,.56. Since, on average, each student was observed just over 30 times during the clerkship year, an estimate of the reliability is provided of the mean student CEF across 30 observations. Sample B reflects the reliabilities of mean student CEFs when the proportion of SL5 CEFs equals that of the general population. Sample C, which retains 230 SL5 forms, demonstrates the impact on reliability when the proportion of SL5s is larger than that of the general population. The G coefficients in Sample A are slightly lower than are those in Samples B and C (see Table 1). Removing the SL5 forms did not elevate the CEF's reliability. Even when the proportion of SL5 forms was elevated beyond that of the population, the G coefficients remained higher than those of the samples without the SL5 forms.
The magnitude and significance of the correlation coefficient clearly demonstrated that higher-scoring students were more likely to receive the SL5 rating. Although we cannot provide a definitive magnitude-based interpretation of this coefficient, it appears a large percentage of the SL5 ratings reflects students with high performances rather than raters' carelessness. Although beyond the scope of this study, a more precise interpretation of this coefficient's magnitude could be evaluated in terms of the likely attenuation attributable to the level of rating reliability and the reproducibility of the SL5 occurrence measure. However, if a significant number of SL5 forms also result from carelessness, they should tend to add error to an across-form mean student score. We found no evidence of this, as including the SL5 forms did not negatively affect the person variance and the reliability did not appear to suffer. It appears that the SL5 rating configuration does indeed reflect raters' perceptions of a high level of student performance. Given that items appear to reflect a more global assessment of student performance in the clerkship, it is possible raters use the items as an extension of the rating scale, and the SL5 format to convey their highest possible evaluation.
Additional evidence supporting the SL5's validity arose from an informal qualitative study conducted by this study's authors on comments submitted with this sample of CEFs. A single rater assessed a sampling of preceptors' comments on SL5 forms, compared with those not displaying the SL5 format (non-SL5 forms). Comments on 50 randomly selected SL5 forms and 50 randomly selected non-SL5 forms were presented randomly to a blinded rater, who categorized each comment in one of two categories. One category represented comments the rater judged representative of the highest-rated students, and the other, comments expected for less-than-perfectly-rated students. In total, the rater judged 28 of the 100 comments to belong to the highest-rated students. When the researchers cross-checked these classifications, they found all 28 of the highrated comments were from SL5 forms. Additional research is needed to characterize and compare qualitative aspects of comments on SL5 forms with those from non-SL5 forms. However, this small qualitative study provides additional evidence supporting the validity of the SL5 format.
Our findings strongly support the validity of the SL5 format. In the past it was difficult to interpret assessments containing uniformly maximum positive ratings. Those responsible for summarizing these evaluations were uncertain about how to use such information. A tendency existed to suspect such ratings arose from the preceptor's inability to conduct a thorough student observation; hence evaluators may have devalued such responses. As SL5 ratings appear to result from high-level performances, students deserve credit for these evaluations, which should be documented and used for grading and assigning special honors.
Limitations exist regarding the degree to which one can generalize these study results to other clinical evaluation environments. First, clerkship evaluation forms vary in rating scales and question formats. Although this study's results offer insight primarily into the rating process preceptors use, which might remain consistent across different forms, a scale change could also alter the nature of the uniform maximum positive response format. In addition, it would be informative, but possibly not ethical since they are used for grading, to address this question in a more experimental fashion. Such a design might use form and item reversals to study how response formats arise as artifacts of the rating scale rather than the rating process.
1. Magarian GL, Mazur DJ. Evaluation of students in medicine clerkships. Acad Med. 1990;65:341–5.
2. Villanueva AM, Kaye D, Abdalhak SS, Morahan PS. Comparing selection criteria of residency directors and physician employers. Acad Med. 1995;70:261–71.
3. Oskamp S. Attitudes and Opinions. 2nd ed. Englewood Cliffs, NJ: Prentice Hall, 1991.
4. Nunnally JC. Psychometric Theory. New York: McGraw—Hill, 1967.
5. Beardon WO, Rose RL. Attention to social comparison information. An individual difference factor affecting consumer conformity. J Consumer Res. 1990;16:461–71.
6. Hui CH, Triandis HC. The instability of response sets. Public Opin Q. 1985;49:253–60.
7. McClendon MJ. Acquiescence and recency response-order effects in interview surveys. Sociological Methods and Research. 1991;20:60–103.
8. Greenleaf EA. Measuring extreme response style. Public Opin Q. 1992;56:328–51.
9. Bachman JG, O'Malley PM. Yea-saying, nay-saying and going to extremes: black and white differences in response styles. Public Opin Q. 1984;48:491–509.
10. Topf M. Response sets in questionnaire research. Nurs Res. 1986;35:119–21.
11. Kreiter CD, Ferguson KJ, Lee W, Brennan RL, Densen P. A generalizability study of a new standardized rating form used to evaluate students' clinical clerkship performances. Acad Med. 1998;73:1294–8.
12. Callahan CA, Erdmann JB, Hojat M, et al. Validity of faculty ratings of students' clinical competence in core clerkships in relation to scores on licensing examinations and supervisors' ratings in residency. Acad Med. 2000;75(10 suppl):S71–S73.
13. Hollander M, Wolfe D. Nonparametric Statistical Methods. New York: John Wiley & Sons, 1973.