Interpretation of Changes in Health-related Quality of Life: The Remarkable Universality of Half a Standard Deviation : Medical Care

Secondary Logo

Journal Logo

Point/Counterpoint

Interpretation of Changes in Health-related Quality of Life

The Remarkable Universality of Half a Standard Deviation

Norman, Geoffrey R. PhD*; Sloan, Jeff A. PhD; Wyrwich, Kathleen W. PhD

Author Information
Medical Care 41(5):p 582-592, May 2003. | DOI: 10.1097/01.MLR.0000062554.74615.4C
  • Free

Abstract

Background.  

A number of studies have computed the minimally important difference (MID) for health-related quality of life instruments.

Objective.  

To determine whether there is consistency in the magnitude of MID estimates from different instruments.

Methods.  

We conducted a systematic review of the literature to identify studies that computed an MID and contained sufficient information to compute an effect size (ES). Thirty-eight studies fulfilled the criteria, resulting in 62 ESs.

Results.  

For all but 6 studies, the MID estimates were close to one half a SD (mean = 0.495, SD = 0.155). There was no consistent relationship with factors such as disease-specific or generic instrument or the number of response options. Negative changes were not associated with larger ESs. Population-based estimation procedures and brief follow-up were associated with smaller ESs, and acute conditions with larger ESs. An explanation for this consistency is that research in psychology has shown that the limit of people’s ability to discriminate over a wide range of tasks is approximately 1 part in 7, which is very close to half a SD.

Conclusion.  

In most circumstances, the threshold of discrimination for changes in health-related quality of life for chronic diseases appears to be approximately half a SD.

The interpretation of changes in health-related quality of life (HRQL) has been a research focus for more than a decade. 1 More recently, researchers have been devising methods to identify a minimal level of change consistent with real, as opposed to statistically significant, benefit. 2

The determination of the minimal level of real change for any HRQL scale can be a daunting task. It may potentially vary for different questionnaires, different diseases, and different demographic groups. Potential influences related to the questionnaire itself include the relative position of the individual on the HRQL scale (ie, floor and ceiling effects 3), the number of steps on the scale, the number of items, and so forth. The intent of the analysis (making a diagnosis vs. testing the efficacy of an intervention) and the identity of the individual performing the assessment (patient vs. clinical staff) could potentially result in a different estimate of important change. Collectively, considering all these variables for each measure, this variation represents a prohibitive impediment to the successful implementation of HRQL endpoints in clinical research and practice.

However, there is some evidence that some of these various factors may have a relatively small impact on the magnitude of the minimal difference. Certainly, some authors have noted that, over a series of studies with a diversity of conditions and age groups using disease-specific measures with 7-point response scales, the minimally important difference (MID) appears to fall consistently close to 0.5 points on the 7-point scale. 1,4,5 It is the thesis of the present article that there is more commonality than difference in the variety of approaches. We will show that a multiplicity of methods, using several different scales from time tradeoff to visual analogue, with the number of items ranging from 1 to more than 50, in a diversity of chronic conditions, led to remarkably similar estimates. We will argue that this convergence is not accidental, but is a direct consequence of the limit of human discrimination ability. Finally, we will point out some circumstances that diverge from this consistency.

Approaches to Estimating the Minimally Important Difference

Perhaps the earliest criterion for identifying important change was devised by Cohen, 6 who expressed differences as an effect size—the average change divided by the baseline SD. He stated that in the context of comparing group averages, a small effect size was 0.2, a medium was 0.5, and a large effect size was 0.8. His primary intent was to provide some basis for sample size calculations. However, although Cohen 6 did indicate the criteria “as convention,” they have frequently been referred to in health sciences literature to decide whether a change is important or unimportant, including the assertion that a moderate effect size of half a SD was typically important.

Some authors have disputed the definition by Cohen 6 of a moderate effect size as arbitrary, although without substantive arguments for other values. Testa 7 suggested that a threshold medium effect size for individual change should be set at 0.6 rather than 0.5, but this definition also appears arbitrary. Feinstein 8 suggested an alternative value of 0.56, which resulted from detailed mathematical derivations related to the correlation coefficient. Still, although there may be a case for these different values, they remain fairly close to a 0.5 effect size. Conversely, Sloan et al 9 argued that an effect size of 0.5 corresponds to roughly the same value as the 0.5/7 shown with anchor-based methods 1 by assuming that, if the entire range of any scale is considered to span 6 standard deviations, then the 0.5 effect size by Cohen 6 would equate to approximately 8%, or almost exactly 0.5 on a 7-point scale.

By contrast, anchor-based methods explicitly examine the relationship between an HRQL measure and an independent criterion (or anchor) to elucidate the meaning of a particular degree of change. The most popular anchor-based approach uses an estimate of the MID, defined as “the smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troublesome side-effects and excessive cost, a change in the patient’s management.”1 Typically, methods used to assess the MID from patients are based on a retrospective judgment about whether they have improved, stayed the same, or worsened over some period of time. The methods then establish a threshold based on the change in HRQL in patients who report minimal change, either for better or for worse.

An important consideration, in view of what follows, is one of terminology. Nowhere in the operationalization of the MID approach is there a consideration of importance, or of the tradeoff between benefit and side effects or costs, as appears in the frequently cited definition above. Thus, the criterion may be more appropriately thought of as a minimally detectable difference (MDD), not a minimally important difference. 10,11 As such, it can be viewed as a threshold of detection, analogous to the just noticeable difference of psychophysics.

Another class of anchor-based method involves longitudinal follow-up to determine whether subgroups can be identified that have clinically different outcomes, such as rehospitalization, relapse of cancer, Medical Research Council grading, or different interventions. Although these approaches clearly yield differences that are clinically important (CID), it is not at all clear that they are, in any sense, minimal.

A final approach, population-based, popularized by Stewart et al, 12 identifies subpopulations with minimally different levels of health (for example, on hypertensive therapy vs. normal, 50 vs. 45 years old) and then looks at the differences in scores on a generic HRQL measure. Although these differences have external significance in terms of population differences, the link to clinical significance, or to any estimate of minimal difference at the individual level, is unclear.

To create a uniform nomenclature in the text to follow, we refer to a difference derived from an individual anchor-based method as minimal difference (MD), with the two subclasses defined above as MDD and CID. Consideration of population difference approaches will emerge in the discussion.

Distribution and anchor-based methods appear to be conceptually very different. 13 Typically it is argued that effect sizes are based entirely on statistical criteria and are thus are dependent on the SD, which may conceivably change from one sample to another. By contrast, most anchor- based approaches are based on an external criterion like retrospective judgment of change, and hence are presumed to be sample-independent.

Despite these apparent differences between methods, it is not clear that they will lead to radically different results. Indeed, there is evidence that some apparently diverse methods yield similar findings. Wyrwich 14–16 showed that the standard error of measurement (SEM) (ς 1 - R) from psychometric theory, yielded results approximately the same as the 0.5 on a 7-point scale of the MID methods, which amounts to a convergence between distribution and anchor-based methods. In turn, other common change criteria such as the Jacobson reliable change index 17 and the standardized response mean 18 are related to the SEM, differing only in a numerical multiplier.

These informal observations suggest that the MD, whether defined distributionally or from external anchors, often emerges as approximately equal to 1/2 of a SD. If this were shown to be true based on a more systematic review, it could result in a substantial change in the current research focus. Instead of cataloging the minimal difference for different measures and populations, research could more profitably be directed at determining what factors related to questionnaire design, intended use, and clinical or population characteristics may lead to a departure from a default value of 0.5 SD. It is not inconceivable that an MD for a specific situation might be created a priori by taking into account the general characteristics of the situation, rather than having to conduct a separate study for every HRQL instrument in every setting.

Research Question

We conducted a systematic review of the literature to locate studies that had attempted to determine a threshold using one of the anchor-based methods and that had, in addition, provided sufficient information to permit a calculation of an effect size, to examine the degree of similarity or difference among methods. We explored 4 questions with this data set:

  1. What is the degree of consistency of minimal difference determined by anchor-based methods, as expressed by effect sizes?
  2. Is the effect size influenced by the anchoring criterion? Specifically, do methods directed at determining the MDD, as defined above, yield effect sizes that differ from methods directed at determining the CID?
  3. Is the effect size related to the type of scale used, specifically the number of steps in the response scale?
  4. Is the effect size related to whether the instrument is generic or disease-specific?

Materials and Methods

We systematically reviewed the HRQL literature to identify studies that (1) developed or used MD thresholds to evaluate change over time and (2) included data for the baseline SD of the HRQL measures. We began with a comprehensive computerized literature search of the Medline database of published articles from 1966 to April 2002. The key words quality of life were intersected with the text words meaningful change, relevant change, important difference, meaningful difference, relevant difference, important difference, minimal importance, clinical significance, minimally important, and effect size. The resulting articles were examined for their appropriateness to this review. The initial search yielded a total of 174 articles. Inspection of the titles and abstracts eliminated editorials and review articles, research agenda proposals, and other articles that would not include a threshold for change and the SD of measured subjects at baseline and reduced the number for further consideration to 58. We obtained copies of these 58 articles, and further review left 34 articles that met the criteria above for which effect sizes could potentially be computed. However, on detailed analysis, 5 more articles were eliminated for a number of reasons (incoherent data—for instance, there was no consistency in the amount of change on the HRQL instrument and the global measure of degree of change; incorrect statistics, such as an effect size computed from the SD of change scores; methodological problems—degree of change defined within the HRQL instrument), leaving 29 articles. These articles were then supplemented with a further 9 studies from the authors’ files. These were published articles not identified on the search (6) (usually because the primary intent of the study was not related to establishing an MID), recent publications (1), preprints (1), or conference presentations (1). From these 38 studies, we were able to compute a total of 62 effect sizes because many studies contained more than 1 instrument. Six studies were excluded from the main analysis and will be discussed separately.

To address factors which might influence these results, we coded each study on 3 different descriptive variables:

  1. Method of determining the difference –MDD using either the Jaeschke 1 or Redelmeier 19 patient-perceived difference approach (n = 43) or CID (n = 13), based on a clinical outcome event
  2. Seven-point scale (n = 14) or other scaling method (n = 42)
  3. Disease-specific (n = 39) or generic (n = 17) HRQL instrument

Finally, we used the data from a few of these studies (9 studies, 20 effect sizes) to examine whether negative changes are associated with larger MDs, as has been claimed by some authors. 20

All of the contrasts were examined with t-tests. We recognize that in the present data set, it often arises that multiple measures are derived from the same study. Under these circumstances, a simple t-test, which treats the data as independent observations, is liberal. However, in the event, all the t-tests were not significant; thus, we can be confident in our interpretation.

Results

A summary of the studies 3,5,8,12,14,15,21–54 is shown in Table 1. For each study, we have listed the first author and year of publication, the clinical condition, the specific instruments used, the nature of the response scale, the total sample size (not all of which will have entered into the MDD or CID calculation, since these are usually based on subgroups), the mean age, the proportion male, the follow-up, where these data were available, and the effect size.

T1A-4
Table 1:
Distribution of Effect Sizes Computed From Estimated Minimal Differences From 33 Studies
T1B-4
Table 1A:
(Continued)

When the instrument in question uses multiple subscales, independence cannot assumed. Consequently, we computed the mean effect size across all subscales, and also indicated in the last column the range of observed effect sizes. Similarly, if the method involved multiple comparisons (for example, in the study by Bestall et al, 36 comparison between Medical Research Council grades 3 and 4 and comparison between grades 4 and 5), the average was computed and the range listed in the last column.

The most common condition studied was chronic obstructive pulmonary disease, with heart failure, asthma, and various cancers as other commonly studied conditions. Three studies looked at arthritis, 2 at gastrointestinal disease, 1 at carpal tunnel syndrome, and 1 at multiple sclerosis. Reflecting the conditions studied, the most common disease-specific questionnaire was the Chronic Respiratory Questionnaire. Other McMaster-derived questionnaires using similar item structures and 7-point scales are the Asthma Quality of Life Questionnaire (AQLQ), the Chronic Heart Questionnaire (CHQ) for heart failure, the Western Ontario and McMaster Universitites Osteoarthritis Index (WOMAC) for arthritis, the Short Inflammatory Bowel Disease Questionnaire (SIBDQ) for colitis, the Quality of Life in Reflux and Dyspepsia (QOLRAD) for gastrointestinal pain, and the Rhinitis Quality of Life Questionnaire (RQLQ) for rhinitis. A number of studies used the Medical Outcome Study Short Form 36 (SF-36) generic instrument or subscales derived from the SF-36 and the Sickness Impact Profile (SIP). In addition, there were several cancer specific instruments (FACT-G, LCS, TOI). Response scales were dominated by 7-point and 5-point scales, but a number of other methods (visual analogue scale, time tradeoff, symptom checklists) were also used.

The mean effect size across all measures was 0.495, SD = 0.15. The overall distribution of effect sizes is shown in Figure 1, which displays a reasonably normal distribution. Three studies with effect size greater than 0.90 resulted from the use of a clinical endpoint to subdivide groups and cannot be assumed to represent a minimal value. Thus, the studies we have examined result in estimates of a threshold of important change that are consistently close to approximately one half of the SD (baseline).

F1-4
Fig. 1:
Distribution of effect sizes computed from 56 estimates of the minimal difference derived from 33 studies.

Factors Influencing the Magnitude of the Minimal Effect Size

Further analysis examined some factors within the studies that may influence the magnitude of the minimal difference.

The first factor examined was the method used to obtain the MD, basically in the 2 broad classes of MDD or CID, as discussed earlier. Whereas the CID mean was marginally higher (0.53 vs. 0.47), this value was not significant (t [54] = 1.12, P = 0.27). Similarly, the standard deviations were smaller among the MDD studies (0.15 for MDD, 0.18 for CID), but this difference was not significant (Levene F = 0.04, P = 0.84).

We then examined the nature of the scale response to determine whether the uniformity of the limit of discrimination of 1 in 7 was in some way related to the use of a 7-point scale. Comparing 7-point scales to all others, the mean effect sizes were both close to 0.5 (0.53 for 7-point scales, 0.47 for other scales, t [54] = 1.10, P = 0.27). Further, the heterogeneity among the other scaling approaches was not reflected in heterogeneous effect sizes; the standard deviations were very close (0.19 for 7-point scales vs. 0.14 for other scales, Levene F = 0.31, P = 0.58). An analysis of effect sizes for scales with 4, 5, 7, and 10 points (n = 3, 19, 13, and 3, respectively) showed mean values of 0.43, 0.47, 0.52, and 0.50 (F[3,34] = 0.37, P = 0.77). The correlation between number of scale points and effect size was 0.14 (P = 0.40). Thus, there is no evidence that the 0.50 effect size is restricted to 7-point scales.

Finally, we examined generic versus disease-specific scales. Although a number of studies have shown that disease-specific scales are more responsive, this does not necessarily imply that the MD will be lower. The results confirm this; the mean effect size for disease-specific scales was 0.48 versus 0.50 for generic scales (t[54] = 0.49, P = 0.62).

Thus, all the factors we have examined have shown a relatively small impact on the computed effect size. The mean for all 6 subgroups remained with the range 0.49 to 0.56, consistently close to 1/2 a SD, and no significant differences arose. Although the concern may be raised that we had insufficient power to detect a difference, we have computed that, with a sample size of 35 and 15, we had a power of approximately 0.80 to detect a difference of 0.1.

Why Is It Always 0.5?

It seems too remote a coincidence that all these methods converge on approximately 0.5 SD as a threshold value. But what can be the basis for this remarkable consistency? As it turns out, there is some evidence derived from investigations of the psychology of discrimination that may explain the phenomenon.

In 1956, Miller 55 noted that across a wide range of unidimensional discrimination tasks (saltiness of tastes, points on a line, pitch and loudness of sounds, and so forth), the limit of people’s abilities to make absolute discriminations turned out to be very consistent. People were capable of identifying the category of a particular stimulus (loudness of sounds, saltiness of tastes) accurately until the number of categories reached approximately 7 (with a range from approximately 5–9). This observation led to his classic article “The Magic Number Seven Plus or Minus Two,” in which he argues that this uniformity derives from a fundamental characteristic of human information processing that he calls “channel capacity,” related indirectly to limits on short-term memory.

Perhaps this limit of discrimination also applies here. We must first convert “1 part in 7” to SD units. In the original tasks, the stimuli were sampled from a rectangular distribution with a finite range. It can be shown that for a uniform rectangular distribution 7 units wide, the SD equals 2.16, so 1 part in 7, expressed in SD units, is 1/2.16 or 0.46. Similarly, accounting for Miller’s 55 “plus or minus two,” for a rectangular distribution of 5 levels, the SD is 1.58, and 1 in 5 is an effect size of 0.63; for a distribution 9 units wide, the SD is 2.73, and effect size is 0.36.

Thus, based on Miller’s 55 review, the limit of human discrimination is equivalent to an effect size between 0.36 and 0.63. The effect sizes observed in Table 1 have a range (± 1 sd) from 0.34 to 0.64. Thus, the range of estimates for the MD, expressed in SD units, corresponds almost exactly to the limit of human discrimination identified by Miller 55 more than 40 years ago. Since the measures we examined are based, one way or another, on the notion of a threshold between essentially undetectable and minimally detectable change, it is no coincidence that these disparate methods, conducted on diverse clinical populations with a wide range of instruments and different criteria, arrive at a similar value. Moreover, estimates derived from both minimally detectable and clinically important differences on follow-up yield estimates of the same magnitude.

Is It Always 0.5?

Although the methods we have reviewed demonstrate remarkable uniformity, this finding may not be universal. We excluded 6 studies in total from the summary analysis of Table 1 because they appeared to show abnormally large or small effect sizes and were derived from study populations or methods that were very different from the remaining studies. These studies fell into 3 classes.

In the first class are studies directed at establishing differences based on different populations, a method called population-based by Stewart et al. 12 Brook et al 56 used data from the Rand surveys in the 1970s and 1980s to estimate a minimal difference in quality of life equivalent to 5 years of aging and, alternatively, 17 points on the Holmes and Rahe Life Events scale. They then examined effect sizes for a variety of domains of quality of life. As shown in a study by Testa, 57 these differences were all very close to an effect size of 0.08, much smaller than those observed in the present study. Samsa et al 37 discuss attempts to set a minimal difference on the SF-20 by Stewart et al 12 using the difference in score of hypertensive and normal healthy persons. The difference was an effect size 0.09 to 0.28. Perhaps these results are not surprising; the criterion, while derived from population differences, has no direct relation to a limit of discrimination at an individual level. Further, the equating of 5 years of aging or 17 points on a stress scale to a minimal difference is arbitrary. Moreover, although hypertension can pose a significant health risk, most hypertensives do not have symptoms that would affect HRQL.

A second class of studies is derived from patients recovering from an acute condition. Stratford et al 52 computed MDs for patients receiving physical therapy for low back pain, and found that the minimal difference was a change of 5 points on the Roland Morris Back Pain Questionnaire, which, based on their baseline SD of 5.8, amounts to an effect size of 0.86. Heald et al 51 studied 36 outpatients with acute shoulder pain, of whom 34 showed “meaningful change.” The average change score on a disease-specific measure, the Shoulder Pain and Disability Index (SPADI), was equivalent to an effect size of 1.38; the SIP total showed an effect size of 0.80. For both studies, this amounts to a somewhat larger magnitude of change than those in the other studies considered. However, the patient populations were very different from those with chronic diseases in Table 1; subjects were healthy people recovering from an episode of back or shoulder pain and referred to a therapist, with every expectation of complete recovery. Their standard of comparison is not their baseline pain at the beginning of treatment, but the eventual and expected absence of pain. Perhaps these circumstances result in higher expectations of what constitutes minimal change.

Finally, in the third class, 1 study looked at a very short follow-up interval. Schwartz et al 53 used measures of fatigue following chemotherapy and examined change over a 2-day period. They found effect sizes ranging from 0.05 to 0.33 (mean 0.12) on a variety of fatigue-specific measures. It is possible that, because of the very short time interval, patients were able to recall more precisely their initial state and hence were more sensitive to small changes.

Is It 0.5 SD Across the Range of Responses?

Stratford et al 52 also examined the relation between the MID and baseline values of HRQL. Using the receiver operating characteristic curve (ROC) approach described above, they found a clear positive relation between degree of disability and the MD. Those with minimal impairment (Roland–Morris Questionnaire (RMQ) scores < 8) had an MID of 1, whereas those in the highest impairment category (RMQ > 16) had an MID of 7. Again, this finding may well reflect expectations of return to health, as well as ceiling effects (where with minimal impairment, there is little room to improve). As such, this would appear to be a very different situation than that encountered with chronic diseases.

Is It 0.05 for Positive and Negative Changes?

Although the original articles by Jaeschke et al 1 and Juniper et al 4 found that the MD for positive and negative changes was approximately equal, providing a rationale for combining the 2, this is not a uniform finding, and others have argued that worsening states should be treated differently from improving states. 20 Several of the studies in the present review provided data for both improving and worsening states, permitting an examination of the question. The overall difference between MDs in the worsening and improving groups was 0.09 (0.45 vs. 0.54, t [18] = 1.11, not significant). Thus, we found no evidence that worsening states had a greater MD than improving states.

How Do We Communicate the 0.5 SD Concept?

As Lydick and Epstein 58 pointed out, expressing minimal changes in terms of statistical quantities is of limited value to clinicians. Thus, however ubiquitous the “half a SD” criterion may turn out to be, it may be of limited interpretive value. One simple option is to convert this back into the values for the original HRQL tool. For example, as we showed earlier, a half SD threshold is about the same as an MID of 0.5 on a 7-point scale. Alternatively, Norman et al 25 showed that there is an approximately linear relationship between effect size and the proportion benefitting, so a threshold of 0.5 amounts to a proportion benefitting of the order of 0.17, and, by inverting, a number needed to treat of approximately 6. Finally, for a test-retest reliability of approximately 0.75, the 0.5 SD threshold is exactly equivalent to 1 SEM, which Wyrwich et al 14–16 previously demonstrated to correspond with the MID. Because the reliability of most HRQL measures exceeds 0.75, the 0.5 SD threshold may be a more stringent criterion than 1 SEM.

Conclusion

The present article has shown that under many circumstances, when patients with a chronic disease are asked to identify minimal change, the estimates fall very close to half a SD. This turns out to be approximately the same as the range of human discrimination identified by Miller 55 in 1956. Thus, while initial evidence showing convergence among different approaches 14 may be viewed as nothing more than a remarkable coincidence, we have shown that it may be more than happenstance. There is good reason to presume that this consistency is a reflection of basic psychological processes and could be viewed as a default value unless other evidence comes to light.

Some authors have expressed appropriate concern when simple solutions to complex problems have been proposed. Hays and Woolley 59 specifically argue against the possibility of deriving a single absolute threshold. They argue that no single statistical decision rule or procedure can take the place of a well reasoned consideration of all aspects of the data by a group of concerned, competent, and experienced persons with a wide range of scientific backgrounds and points of view. However, the half SD criterion is not an arbitrary statistical criterion but is empirically derived, based on psychological theory, and is a reflection of the collective judgment processes of the many 100s of individuals who have participated in the studies in quality of life and in the psychology laboratory. Further, just as 0.05 has become an accepted convention for statistical significance, one may expect that the same could occur with time for the 1/2 SD criterion, which is based on a sounder psychological and empirical foundation.

Implications for Research

Although the approximate value of 0.5 SD appears to hold in a wide variety of situations, it would be presumptuous to assume that it is universal, and we have identified some, but obviously not all, exceptions. So where does this leave the researcher and the practitioner?

For the researcher, the finding suggests a refocus. Instead of cataloging the MD for every instrument, every disease, every demographic subgroup, and every possible goal, we should be attempting to identify those factors that will result in systematic and substantial departure from the 0.5 SD norm. A first step may well be to decide what constitutes a substantial departure. We have seen from this survey that differences in effect size of the order of 0.15 to 0.25 can easily occur within and between studies, without any apparent explanation, and that this range of estimates fits within the range of human discrimination identified by Miller. 55 However, we have also seen that some more substantial departures can occur, and we have speculated about the reason for these exceptions. But this possibility remains unproven, and it is not clear what variables influence the magnitude of this difference. Conceptual frameworks for looking at the factors that may influence responsiveness and the MD, such as that by Beaton et al, 60 may be useful for research, but these may not capture all important factors. Further, although some definitions of the MD include issues of importance, benefits, side effects, and costs, the precise way in which these factors may influence the threshold remains unknown.

For the practitioner or regulator, it may well be that the value of 0.5 SD can serve as a default value for important patient-perceived change on HRQL measures used with patients with chronic diseases. That is, in the situation in which there is little evidence about how an instrument will perform (which must, for some time, be the norm), interpretation about the clinical importance of a treatment effect could initially be based on whether it does or does not exceed 0.5 SD. It would be inappropriate for this be viewed as a fixed benchmark, like the α of 0.05 for statistical significance, but it would not be inappropriate to consider this as an approximate rule of thumb in the absence of more specific information.

References

1. Jaeschke R, Singer J, Guyatt G. Measurement of health status: ascertaining the minimally important difference. Control Clin Trials 1989; 10: 407–415.
2. Sloan JA, Cella D, Frost MH, et al. Assessing clinical significance in measuring oncology patient quality of life: introduction to the symposium, content overview, and definition of terms. Mayo Clin Proc 2002; 77: 371–383.
3. Stucki G, Daltroy L, Katz JN, et al. Interpretation of change scores in ordinal clinical scales and health status measures: the whole may not equal the sum of the parts. J Clin Epidemiol 1996; 49: 711–717.
4. Juniper EF, Guyatt GH, Willan A, et al. Determining a minimal important change in a disease-specific quality of life questionnaire. J Clin Epidemiol 1994; 47: 81–87.
5. Juniper EF, Guyatt GH, Feeny DH, et al. Measuring quality of life in childhood asthma. Qual Life Res 1996; 5: 35–46.
6. Cohen J. Statistical Power Analysis for the Behavioural Sciences. London: Academic Press; 1969.
7. Testa M. Interpreting quality of life clinical trial data for use in the clinical practice of antihypertensive therapy. J Hypertens 1987; 5: S9–S13.
8. Feinstein AR. Indexes of contrast and quantitative significance for comparisons of two groups. Stat Med 1999; 18: 2557–2581.
9. Sloan JA, Loprinzi CL, Kuross SA, et al. Randomized comparison of four tools measuring overall quality of life in patients with advanced cancer. J Clin Oncol 1998; 16: 3662–3673.
10. Jones P. Interpreting thresholds for a clinically significant change in health status (quality of life) with treatment for asthma and COPD. Eur Resp J 2002; 19: 398-404.
11. Wright JG. The minimally important difference: who’s to say what is important? J Clin Epidemiol 1996; 49: 1221–1222.
12. Stewart AL, Greenfield S, Hay RD, et al. Functional status and well-being of patients with chronic conditions: results from the medical outcomes study. JAMA 1989; 262: 907–913.
13. Guyatt GH, Juniper ED, Walter SD, et al. Measuring health-related quality of life. Ann Int Med 1993; 118: 622–629.
14. Wyrwich KW, Nienaber NA, Tierney WM, et al. Linking clinical relevance and statistical significance in evaluating intra-individual changes in health-related quality of life. Med Care 1999; 37: 469–478.
15. Wyrwich KW, Tierney WM, Wolinsky FD. Further evidence supporting an SEM-based criterion for identifying meaningful intra-individual changes in health-related quality of life. J Clin Epidemiol 1999; 52: 861–873.
16. Wyrwich KW, Tierney WM, Wolinsky FD. Using the standard error of measurement to identify important changes on the Asthma Quality of Life Questionnaire. Qual Life Res 2002; 11: 1–7.
17. Jacobson NS, Truax P. Clinical significance: a statistical approach to defining meaningful change in psychotherapy research. J Consult Clin Psychol 1991; 59: 12–19.
18. McHorney C, Tarlov A. Individual-patient monitoring in clinical practice: are available health status measures adequate? Qual Life Res 1995; 4: 293–307.
19. Redelmeier DA, Guyatt GH, Goldstein RS. Assessing the minimal important difference in symptoms: a comparison of two techniques. J Clin Epidemiol 1996; 49: 1215–1219.
20. Cella DF, Hahn EA, Dineen K. Meaningful change in cancer-specific quality of life scores: differences between improving and worsening. Qual Life Res 2002; 11: 207–221.
21. Kazis LE, Anderson JJ, Meenan RF. Effect sizes for changes in health status. Med Care 1989; 27: S178–S189.
22. Fitzpatrick R, Ziebland S, Jenkinson C, et al. Importance of sensitivity to change as a criterion for selecting health status measures. Qual Health Care 1992; 1: 89–93.
23. Wells GA, Tugwell P, Kraag GR, et al. Minimum important difference between patients with rheumatoid arthritis: the patient’s perspective. J Rheumatol 1993; 20: 557–566.
24. Jones PW, Bosh TK. Quality of life changes in COPD patients treated with salmeterol. Am J Resp Crit Care Med 1997; 155: 1283–1289.
25. Norman GR, Guyatt GH, Gwadry Sridhar F, et al. Comparison of distribution based and anchor-based approaches to interpretation of changes in health related quality of life. Med Care 2001; 39: 1039–1047.
26. Wijkstra PJ, Ten Vergert EM, van Altena R, et al. Long term benefits of rehabilitation at home on quality of life and exercise tolerance in patients with chronic obstructive pulmonary disease. Thorax 1995; 50: 824–828.
27. Wijkstra PJ, Van Altena R, Kraan J, et al. Quality of life in patients with chronic obstructive pulmonary disease improves after rehabilitation at home. Eur Respir J 1994; 7: 269–273.
28. Cella DF, Bonomi AE, Lloyd SR, et al. Reliability and validity of the Functional Assessment of Cancer Therapy-Lung (FACT-L) quality of life instrument. Lung Cancer 1995; 12: 199–220.
29. Goldstein RS, Gort EH, Guyatt GH, et al. Prospective randomized controlled trial of respiratory rehabilitation. Lancet 1994; 344: 1394–1397.
30. Osman LM, Godden DJ, Friend JAR, et al. Quality of life and hospital re-admission in patients with chronic obstructive pulmonary disease. Thorax 1997; 52: 67–71.
31. Redelmeier DA, Bayoumi AM, Goldstein RS, et al. Interpreting small differences in functional status: the six minute walk test in chronic lung disease patients. Am J Respir Crit Care Med 1997; 155: 1278–1282.
32. Bessette L, Sangha O, Kuntz KM, et al. Comparative responsiveness of generic and weighted vs. unweighted health status measures in carpal tunnel syndrome. Med Care 1998; 36: 491–502.
33. Guell R, Casan P, Sangenis M, et al. Quality of life in patients with chronic respiratory disease: the Spanish version of the Chronic Respiratory Questionnaire. Eur Respir J 1998; 11: 55–60.
34. Patrick DL, Martin ML, Bushnell DM, et al. Quality of life in women with urinary incontinence: further development of the incontinence quality of life instrument (I-QOL). Urology 1999; 53: 71–76.
35. Bagenstowe SE, Bernstein JA. Treatment of chronic rhinitis by an allergy specialist improves quality of life outcomes. Ann Allergy Asthma Immumol 1999; 83: 524–528.
36. Bestall JC, Paul EA, Garrod R, et al. Usefulness of the Medical Research Council (MRC) dyspnoea scale as a measure of disability in patients with chronic obstructive pulmonary disease. Thorax 1999; 54: 581–586.
37. Samsa G, Edelman D, Rothman ML, et al. Determining clinically important differences in health status measures: a general approach with illustration to the Health Utilities Index Mark II. Pharmacoeconomics 1999; 15: 141–155.
38. Santanello NC, Zhang J, Seidenberg B, et al. What are minimal important changes for asthma measures in a clinical trial? Eur Respir J 1999; 14: 23–27.
39. Troosiers T, Gosselink R, Decramer M. Short and long term effects of outpatient rehabilitation in patients with chronic obstructive pulmonary disease: a randomized trial. Am J Med 2000; 109: 207–212.
40. Kosinski M, Zhao SZ, Dedhiya S, et al. Determining minimally important changes in generic and disease-specific quality of life questionnaires in clinical trials of rheumatoid arthritis. Arthritis Rheum 2000; 43: 1478–1487.
41. Miller DM, Rudick RA, Cutter G, et al. Clinical significance of the multiple sclerosis functional composite: relationship to patient-reported quality of life. Arch Neurol 2000; 57: 1319–1324.
42. Badia X, Podzamczer D, Casado A, et al. Evaluating changes in health status in HIV-infected patients: Medical Outcomes Study-HIV and Multidimensional Quality of Life-HIV quality of life questionnaires. AIDS 2000; 14: 1439–1447.
43. Angst F, Aeschlimann A, Stucki G. Smallest detectable and minimally clinically important differences of rehabilitation intervention with their implications for required sample sizes using WOMAC and SF-36 quality of life measurement instruments in patients with osteoarthritis of the lower extremities. Arthritis Care Res 2001; 45: 384–391.
44. Jowett SL, Seal CJ, Barton R, et al. The short inflammatory bowel disease questionnaire is reliable and responsive to clinically important change in ulcerative colitis. Am J Gastroenterol 2001; 96: 2921–2928.
45. Segal R, Evans W, Johnson D, et al. Structured exercise improves physical functioning in women with stages I and II breast cancer: results of a randomized controlled trial. J Clin Oncol 2001; 19: 657–665.
46. Singh SJ, Sodergren SC, Hyland ME, et al. A comparison of three disease-specific and two generic health-status measures to evaluate the outcome of pulmonary rehabilitation in COPD. Respir Med 2001; 95: 71–77.
47. Talley NJ, Fullerton SF, Junghard O, et al. Quality of life in patients with endoscopy-negative heartburn: reliability and sensitivity of disease-specific instruments. Am J Gastroenterol 2001; 96: 1998–2004.
48. Cella D, Eton DT, Fairclough DL, et al. What is a clinically meaningful change on the Functional Assessment of Cancer Therapy-Lung (FACT-L) questionnaire? results from the Eastern Cooperative Oncology Group (ECOG) study 5592. J Clin Epidemiol 2002; 55: 285–295.
49. Ringash J, Redelmeier DA, O’Sullivan B, et al. Quality of life and utility in irradiated laryngeal cancer patients. Int J Radiat Oncol Biol Phys 2000 Jul 1;47(4):875-81
50. Angst F, Aeschlimann A, Michel BA, et al. Minimal clinically important rehabilitation effects in patients with osteoarthritis of the lower extremities. J Rheumatol 2002; 29: 131–138.
51. Heald SL, Riddle DL, Lamb RL. The shoulder pain and disability index: the construct validity and responsiveness of a region-specific disability measure. Phys Ther 1997; 77: 1079–1089.
52. Stratford PW, Binkley J, Solomon P, et al. Defining the minimal level of detectable change for the Roland Morris Questionnaire. Phys Ther 1996; 76: 359–365.
53. Schwartz AL, Meek PM, Nail LM, et al. Measurement of fatigue: determining minimally important clinical differences. J Clin Epidemiol 2002; 55: 239–244.
54. Talley NJ, Fullerton S, Junghard O, et al. Quality of life in patients with endoscopy-negative heartburn: reliability and sensitivity of disease-specific instruments. Am J Gastroenterol 2001; 96: 1998–2004.
55. Miller GA. The magic number seven plus or minus two: some limits on our capacity for processing information. Psychol Rev 1956; 63: 81–97.
56. Brook RH, Ware JE, Rogers WH, et al. Does free care improve adults’ health? N Engl J Med 1983; 309: 1426–1434.
57. Testa MA. Interpretation of quality-of-life outcomes: issues that affect magnitude and meaning. Med Care 2000; 38: II166–II74.
58. Lydick E, Epstein RS. Interpretation of quality of life changes. Qual Life Res 1993; 2: 221–226.
59. Hays RD, Woolley JM. The concept of clinically meaningful difference in health-related quality of life research. Pharmacoeconomics 2000; 18: 419–423.
60. Beaton DE, Bombardier C, Katz JN, et al. Looking for important change/differences in studies of responsiveness. J Rheumatol 2001; 28: 400–405.
Keywords:

Quality of life; threshold; interpretation; MID; effect size

© 2003 Lippincott Williams & Wilkins, Inc.