Developing an attitudes survey is no simple task. Developers typically rely on both tested procedures and “common wisdom” found in measurement or research texts and passed on from mentors to students. We distinguish “tested procedures” from “common wisdom,” defining the latter as widely held beliefs used to guide practice, beliefs often traceable to authoritative sources but not subjected to critical evaluation. One such belief is that negatively worded items should be incorporated in attitude measures to counter response sets.1 Responses to negatively worded items are then reverse-scored and combined with responses to positively worded items. This common wisdom has not been questioned in the medical education literature, although studies from other domains raise significant concerns. In this article, we review studies examining inclusion of negatively phrased items in surveys and then present a case study of the consequences of negatively phrased items in a measure used in medical education research.
Incorporating both positively and negatively phrased items in attitude surveys has been common practice for at least three decades and is recommended by widely cited texts.1,2 Some texts3,4 present the practice as a straightforward recommendation; others, notably those by Nunnally1 and Anastasi,2 cite the response set literature, particularly social desirability and acquiescence/agreement, as a theoretical base, recommending the use of negatively phrased items to counter these undesirable influences. When consulting with questionnaire-design groups, we have encountered expectations that negatively phrased items be included. The practice is also found in recent publications; notably, Hojat et al.5 cited Anastasi2 for their use of half positively and half negatively phrased items in the Jefferson Scale of Physician Empathy (HP Version).
Though the use of negatively phrased items may be commonplace in medical education attitude measures, research in other domains has questioned the practice. Findings challenged the assumption that positively phrased attitude items could simply be reexpressed in negative terms and reverse-scored without changing meanings. Taylor and Bowers6 reported that subjects responded differently to negatively phrased items and that, when reverse-scored, negative items’ means differed from positive items’ means. Schriesheim and Hill's7 experimental study of responses to positively and negatively phrased items on a leadership measure found that using negatively phrased items reduced score reliability; respondents answered negatively phrased items differently, also affecting score validity. Pilotte and Gable8 found decreased reliability in a version of a computer anxiety scale incorporating negatively phrased items, noting that negative items’ response means differed systematically from similar positive items’ response means. Additional cautions about effects on reliability were raised in studies by Barnette9 and Schriesheim et al.10 Other factor-analytic studies11,12 demonstrated a “method effect” associated with negatively phrased attitude survey items. These studies questioned the wisdom of routinely incorporating negatively phrased items into surveys.
Early analyses of the Medical School Learning Environment Survey (MSLES), a key measure in longitudinal studies of the University of Texas Medical Branch (UTMB) medical students, led us to suspect that previously described studies might apply. We therefore framed three research questions about scores from this survey that incorporates both positively and negatively worded items: (1) Does the distribution of mean item-level scores of negatively phrased items differ from the distribution of mean item-level scores of positively phrased items? (2) Within each of the seven MSLES scales, how do means and standard deviations of positive- and negative-item sets differ? (3) How are internal-consistency score reliability estimates affected by the presence of mixed positive and negative items on each scale?
The MSLES13,14 is a measure of students’ perceptions of the learning environment in medical schools. Adapted from Marshall's13 work in the late 1970s, the MSLES is a 55-item measure with seven scales, each with both positively and negatively phrased items: nurturance (NUR), organization (ORG), flexibility (FLEX), student-to-student interaction (SSINT), breadth of interest (BREADTH), emotional climate (EMOTION), and meaningful learning experience (MEANLE). Items are scored on a five-point scale (from “never” to “very often”). After reverse-scoring negative items, scale scores are formed by averaging responses across the scale's designated items. Marshall intended the negatively phrased items to counter undesirable response sets.13 , p.100 This use reflected the common assumption that a scale's positive and negative items measured the same unidimensional construct on the same response continuum. Examples from the FLEX scale are “Faculty try out new teaching methods and materials” (+) and “Curricular and administrative policies are inflexible” (−).
The MSLES is administered to all UTMB medical students during orientation week and three times thereafter, through the fourth year. We analyzed MSLES data from six classes (of approximately 200 students each), beginning with students entering in 1995 and ending with those entering in 2000. Data from administration during orientation (Time 1) and from the end of the first year (Time 2) for these classes were analyzed separately to incorporate replication into the study. Cases with one or more MSLES items missing were dropped from analysis.
In addition to mean item responses, we calculated scale means and standard deviations for Time 1 and Time 2 administrations across the six classes. For each MSLES scale, we constructed a positive-item set and a negative-item set from the appropriate items (e.g., “FLEX/positive” and “FLEX/negative”) and compared their respective score means and standard deviations.
We computed Cronbach's alpha for each scale and for their positive- and negative-item sets for Time 1 and Time 2 data. Alpha estimates scale items’ internal consistency based on shared variance among item scores and the assumption that scale items measure the same construct. We compared alpha estimates to examine the effects of scale composition. Since the alpha coefficient is sensitive to number of items, the Spearman-Brown prediction formula15 was applied to each item set's alpha value, thereby estimating the item set's reliability as though its number of items matched that of the original MSLES scale. This procedure yielded “predicted alphas” for comparison of positive- and negative-item set reliabilities to the reliability of the original scale with mixed positive and negative items.
Finally, we calculated a stratified alpha coefficient for each MSLES scale, using the formula for the reliability of a linear composite described by Feldt and Brennan.16 Each MSLES scale score was treated as a linear composite of scores from the positive- and the negative-item sets. This assumes that the scale's item sets are related but different measures. We then compared the stratified alpha value to the original scale's alpha to observe the effect of that different assumption.
At Time 1, 1,517 students with complete MSLES data were included in the study; 1,075 students had complete data at Time 2.
In comparisons of the 55 item mean scores from Time 1 data, we found that 66% of positive-item means (range 2.53–4.55) were greater than the largest mean of the negative items. Similarly, 26% of negative items’ means (range 3.07–4.55) fell below the lowest positive-item mean. The aggregate picture was of two different distributions.
Means for positive-item sets were greater than means for the corresponding negative sets for all MSLES scales at Time 1 and for five of seven scales at Time 2 (Table 1). Mean scale scores for the seven original scales decreased from Time 1 to Time 2, an expected result reflecting the impact of first-year realities on students. Positive and negative-item sets within scales, however, did not capture that change equally. Within each scale, the change in mean scores from Time 1 to Time 2 was greater for the positive-item set than for the negative set; changes in mean negative-item scores were markedly smaller than changes in mean positive-item scores for four scales.
For all Time 1 scale scores and for five Time 2 scale scores, alpha values for the all-positive-item or all-negative-item sets were equal to or greater than the corresponding original scale's alpha values (Table 2). The original scales’ alpha values ranged from .64–.80. In comparison, length-adjusted predicted alpha values for the positive sets ranged from .64–.93 and, for the negative sets, from .60–.92. Under the assumption that positive and negative items measured related but different constructs, stratified alpha values for scales at Time 1 and Time 2 were larger than corresponding original-scale alpha values in eight out of 14 instances. The increases in alpha values were generally greatest when the Spearman-Brown adjustment for number of items was larger. In general, alpha values were larger when scales were composed of only positive or only negative items or were adjusted for the assumption that positive and negative items measure different constructs.
Simple examination of mean item responses indicated that positive and negative items elicited different response patterns from students. Within scales, positive-item sets treated as a scale tended to yield higher mean scores than similar negative-item scales, suggesting that the presence of negative items depressed the scale's mean score. The negative items also appeared to suppress the measure of change from Time 1 to Time 2, as scales’ negative-item sets showed less change than corresponding positive-item sets. This suggests that positive and negative items measure somewhat different constructs, raising questions about how well the intended attitude change is measured by scales incorporating negative items.
Though based on statistical projections, this observational study's results are congruent with earlier experimental findings7,8 that mixing positively worded and negatively worded items on a scale suppresses score reliability when compared to the reliability of the same scale composed of all positively phrased items. More rigorous experimental investigations of this phenomenon in medical-education measures might use the methods of Schriesheim and Hill7 or Pilotte and Gable,8 using three versions of the same measure (all positive items, all negative items, and a mixed format of positive and negative items). More theoretically oriented research might seek to determine reasons for the observed differences: do negative and positive items tap different versions of the intended construct, or do respondents use negatively oriented response scales differently than positively oriented scales?
Though we conclude that the MSLES scale scores might be more reliable had the scales been composed solely of positively phrased items, we do not believe that the scores from the existing instrument are necessarily invalidated. Scale reliability estimates, though perhaps less than they might have been, are adequate. In future research, we will, however, be sensitive to potential suppression of attitude change estimates.
Based on the research literature and this investigation, we recommend that attitude-survey developers carefully estimate the potential for undesirable response sets, the theoretical basis for using negatively phrased items,1,2 before acting on the “common wisdom” to use them as a counterstrategy. Are acquiescence and social desirability phenomena so likely in the study population that the consequences of adopting this strategy should be risked? We agree with Barnette's9 , p. 364 position that “it is probably best that all items be positively or directly worded and not mixed with negatively worded items.” Developers of new instruments should consider carefully before applying the common wisdom.
1. Nunnally JC. Psychometric Theory. 2nd ed. New York: McGraw-Hill, 1978.
2. Anastasi A. Psychological Testing. 5th ed. New York: Macmillan, 1982.
3. Crocker LM, Algina J. Introduction to Classical and Modern Test Theory. Fort Worth: Harcourt Brace Jovanovich, 1986.
4. Anderson AB, Basilevsky A, Hum DPJ. Measurement: theory and techniques. In: Rossi PH, Wright JD, Anderson AB (eds). Handbook of Survey Research. New York: Academic Press, 1983.
5. Hojat M, Gonnella JS, Nasca TJ, Mangione S, Veloski JJ, Magee M. The Jefferson Scale of Physician Empathy: further psychometric data and differences by gender and specialty at item level. Acad Med. 2002;77(10 suppl):S58–60.
6. Taylor JC, Bowers DG. Survey of organizations: a machine-scored, standardized questionnaire instrument. Ann Arbor: Institute for Social Research, 1972.
7. Schriesheim CA, Hill KD. Controlling acquiescence response bias by item reversals: the effect on questionnaire validity. Educ Psychol Meas. 1981;41:1101–14.
8. Pilotte WJ, Gable RK. The impact of positive and negative item stems on the validity of a computer anxiety scale. Educ Psychol Meas. 1990;50:603–10.
9. Barnette JJ. Effects of stem and Likert response option reversals on survey internal consistency: if you feel the need, there is a better alternative to using those negatively worded stems. Educ Psychol Meas. 2000;60:361–70.
10. Schriesheim CA, Eisenbach RJ, Hill KD. The effect of negation and polar opposite item reversals on questionnaire reliability and validity: an experimental investigation. Educ Psychol Meas. 1991;51:67–78.
11. Schmitt N, Stults D. Factors defined by negatively keyed items: the result of careless respondents? Appl Psychol Meas. 1985;4:367–73.
12. Marsh HW. Positive and negative global self-esteem: a substantively meaningful distinction or artifactors? J Pers Soc Psychol. 1996;70:810–9.
13. Marshall RE. Measuring the medical school environment. J Med Educ. 1978;53:98–104.
14. Lieberman SA, Stroup-Benham CA, Peel JL, Camp MG. Medical student perception of the academic environment: a prospective comparison of traditional and problem-based curricula. Acad Med. 1997;72(10 suppl):S13–5.
15. Thorndike RL. Applied Psychometrics. Boston: Houghton Mifflin Company, 1982.
16. Feldt LS, Brennan RL. Reliability. In: Linn RL (ed). Educational Measurement. 3rd ed. Phoenix: American Council on Education and Oryx Press, 1993.