Table 3 presents the average quality of life score for each domain at baseline and change over time for SF-39, SF-12, and SF-12 complement. For the baseline scores, the means for SF-12 and SF-39 were fairly similar within each domain. Although the SF-12 domain had fewer items than the SF-39 domain or the SF-12 complement, the standard deviations in each domain were also in close agreement between the instruments, suggesting that the subset of items comprising SF-12 are representative of the full scale. An examination of boxplots (not shown) revealed that the distributions of SF-39, SF-12 and SF-12 complement domain scores were similar across gender, race, age, CD4 cell count categories and injection drug use. QoL measures over time are considered more important in clinical trials than measures at baseline. The average slopes and corresponding standard errors of the changes over time between the two instruments were in close agreement. Table 3 also shows that SF-12 does have higher proportions of patients who score at either extreme. This is not surprising since SF-12 has fewer possible scores.
Table 4 summarizes the results of correlation analyses. P values for each correlation coefficient are < 0.0001. Pearson product moment correlation between SF-12 and SF-39 was computed for each of the five domains for baseline scores and for the slopes of change for individual participants. These correlation coefficients ranged from 0.77 to 0.94 for baseline scores, and from 0.61 to 0.92 for change over time. Viewing the SF-39 domain scores as criteria, these correlations provide information about criterion-related validity of the SF-12. Since the coefficients of determination (i.e., the squared correlation coefficients) ranged from approximately 0.60 to almost 0.90, we conclude that SF-12 has high validity. The high correlations between SF-12 and SF-39 provide more persuasive evidence about the equivalence of the two instruments than the similarities in the mean and standard errors seen in Table 3, as the high correlations indicate that the two instruments are measuring the same underlying theoretical constructs. These correlation analyses were repeated in subgroups defined by gender, race, age, CD4 cell count categories and injection drug use; the results in each subgroup were similar to those in the whole cohort (results not shown).
The test–retest reliabilities were assessed using both inter- and intraclass correlation coefficients. Table 5 shows that the intraclass correlations were above 0.60, except for the pain domain of SF-12, indicating satisfactory stability across occasions. As will be discussed later, these test–retest correlation coefficients may underestimate the true reliability of the domains. Interclass correlation coefficients were also computed (results not shown) and were almost identical to the intraclass correlation coefficients. The test–retest correlations were also assessed for subgroups defined by gender, race, and CD4 cell count; the results (not shown) were similar to those obtained for the whole cohort.
The reliabilities of the five domains for SF-39 and SF-12 were assessed by internal consistency and test–retest correlation. Table 6 lists Cronbach's coefficient alpha for each of the domains for which it is applicable. The SF-39 domains had values ranging from 0.85 to 0.92, indicating excellent consistency. Cronbach's coefficient alpha cannot be computed for three SF-12 domains with only one item; for the two domains with more than one item, alpha is 0.68 for the mental health domain, and 0.78 for the physical functioning domain.
Item internal consistency is represented by the corrected item–total correlation, that is, correlation between an item and the sum of all the other items in the same domain. Table 6 shows that SF-39 items have excellent internal consistencies, with all but one item–total correlation being above 0.60. The SF-12 items also have good internal consistencies. Note that the corrected item–total correlation coefficients for SF-12 are just the correlation between two items for the two-item scales.
Item discrimination is a concept that indicates an item should have a higher correlation with the scale to which it belongs than with any other scale. For each item, comparison of correlations with target and non-target domains showed that all the domains of SF-12 and SF-39 had 100% scaling success rate, indicating perfect item discrimination.
Table 7 presents proportional hazards regression results for the event death. Since the influence of baseline Karnofsky score is essentially constant for the three models, a typical hazard ratio is reported (e.g., for the physical functioning domain, the hazard ratio for Karnofsky score is very close to 0.92 per 10 unit change in each of the three models). Karnofsky score is predictive for death in all domains other than physical functioning. For all domains other than mental health, the SF-39 and SF-12 scores were highly predictive of death (models I and II, respectively). For model III and those same domains, the SF-12 score was predictive and the SF-12 complement score was not, suggesting that the SF-12 complement score adds little information for predicting death when the influence of the SF-12 score is accounted for. For the mental health domain, the SF-39 QoL score was not predictive for death. The SF-12 and SF-12 complement scores in models II and III were predictive for death in the mental health domain. Interestingly, the hazard ratio for the SF-12 complement score in model III was in the opposite direction from that for the SF-12 score. The SF-12 and SF-12 complement scores are positively correlated (r = 0.83) and it is not clear why this anomaly occurs. When the analyses were repeated without adjusting for Karnofsky score, the results were similar except for model III of the pain and mental health domains. For the pain domain, the SF-12 score became significant, with a hazard ratio of 0.73 and a P value of 0.02; for the mental health domain, the compliment score became insignificant, with a hazard ratio of 1.38 and a P value of 0.12.
Table 8 displays results for the progression of disease endpoint. Karnofsky score was not predictive of POD in any of the domains. Both the SF-12 and the SF-39 scores were predictive (models I and II) for a POD event in the physical functioning, general health, and energy/fatigue domains; neither was predictive in the pain and mental health domains. With model III, the SF-12 score was predictive only for the physical functioning and mental health domains. These analyses were repeated without adjusting for Karnofsky score and the results were almost identical.
In this population of HIV patients with advanced disease, data suggest that SF-12 may be an effective substitute for the SF-39. For the five domains studied, SF-12 contains much of the information present in the longer SF-39. Use of SF-12 would reduce the burden to patients and clinicians created by collecting health-related QoL information. The results described here are also similar to those of Hurst et al. , who compared SF12 and SF-36 in patients with rheumatoid arthritis.
Our evaluations show there is little overall difference in the baseline mean scores between SF-39 and SF-12. The standard deviations are also similar for SF-39 and SF-12, and there is high correlation among baseline values. The similarity and high correlations between the regression slopes of the SF-12 and SF-39 scores appear to indicate that the SF-12 domain scores are as sensitive to change over time as the SF-39 scores. Since the patient-specific regression slopes are estimated from only a few data points and have large standard errors, the estimated correlation is attenuated towards zero from the true, structural relation, making the high correlations we see even more compelling. Also, the estimated correlations are smaller than those estimated from more diverse, general populations , another manifestation of attenuation.
The reliabilities of the five domains in SF-39 and two of the domains in SF-12 are good. Using test–retest correlation to assess reliability assumes that a test score consists of a ‘true’ score and measurement error. If there are other sources of variation that are independent of the true score, then test–retest correlation is only a lower bound for reliability and, hence, tends to underestimate reliability.
SF-12 is just as effective as SF-39 in predicting death and POD, and hazard ratios for the SF-12 regressors depend little on whether the SF-12 complement is included as a regressor, indicating that most of the information contained in SF-39 for predicting clinical events is also present in SF-12. Both questionnaires were less effective in predicting POD than death. This reduction in association may be caused, in part, by the fact that the POD endpoint includes a variety of clinical events with a variety of etiologies and associated markers. Many are relatively minor infections, which may not associate with a patient's health-related QoL.
SF-39 does include information not captured by SF-12 (e.g., cognitive functioning, health distress and health transition domains). Hence, use of SF-12 in place of the longer questionnaire(s) is subject to the assumption that cognitive functioning, health distress and health transition domains are not of interest, either as outcomes or as predictors, to the researchers.
SF-12 values were extracted from information gathered via the SF-39. Consequently, relations and conclusions may be different for information gathered by direct administration of SF-12. It is possible that the items in SF-39 but not in SF-12 provide a context which orients patients to QoL issues; hence the SF-12 items administered as part of SF-39 might have provided more information than the information that would have been gathered had SF-12 itself been administered. It is also possible, however, that fewer items may help patients to concentrate better and hence reduce the amount of ‘noise’ in their responses. Furthermore, our study population consisted of patients with late-stage AIDS and our findings may not generalize to other HIV-infected groups. For example, SF-12 does not include the SF-39 questions of bowling and golfing limitations, and these may be pertinent to healthier persons living with HIV/AIDS.
Another source of limitation is the fact that the patient population in the present study consisted predominantly of white homosexual and bisexual male patients. Caution needs to be exercised in generalizing such results to other patient populations. In addition, the trial that yielded the data analyzed in this paper ended in 1995, and hence some conclusions may also be temporally limited.
In addition to using common tools in questionnaire assessment, we have also used proportional hazard regression in our study to investigate the predictive value of QoL with respect to progression, disease and death. Cunningham et al.  conducted a study that showed the presence, number and severity of constitutional symptoms in HIV disease, among symptomatic individuals, is strongly related to health-related QoL. However, the predictive power of health-related QoL in predicting clinical events was not assessed. Wu et al.  cited a study of patients with CD4 cell counts < 100 × 106 cells/l; in this group, a one point difference in baseline MOS-HIV score was related to a 4% increase in hazard for death. Future analyses utilizing this method may include additional endpoints such as adverse events, loss to follow-up and treatment compliance.
Although baseline scores from the SF-12 domains were as effective as those from the SF-39 in predicting clinical events, such relations are not the principal goal of QoL assessment. Rather, QoL assessment is most commonly used to complement and supplement treatment comparisons based on clinical endpoints (and not as a correlate of these outcomes). Therefore, high predictive power is not necessarily an advantage. In fact, Leplege  argues for a more existential approach to measuring health-related QoL, shifting attention to methods of assessment more capable of reflecting individuals’ views of their health status rather than a medical interpretation of health status. The need to identify a treatment comparison role for QoL assessment that is not captured by clinical endpoints would be highlighted if similarly high predictive power emerging from an evaluation of QoL as a time-varying covariate  is seen.
We thank the patients and clinicians who participated in the CMV study and especially thank Tom Louis, Jim Neaton and others who commented on our work. We thank the anonymous referees whose comments greatly improved this manuscript. The authors are grateful to the following University of Minnesota students for their assistance: Karen A. Clifton, Julie M. Heyd, Yijian Huang and Erika J. Rothe.
1. Testa MA, Simonson DC. Assessment of quality-of-life outcomes. N Engl J Med 1996, 334: 835-840.
2. Heyland DK, Guyatt G, Cook DJ. et al
. Frequency and methodologic rigor of quality-of-life assessments in the critical care literature. Crit Care Med 1998, 26: 591-598.
3. Franchi D, Wenzel RP. Measuring health-related quality of life among patients infected with human immunodeficiency virus. Clin Infect Dis 1998, 26: 20-26.
4. Wu AW, Hays RD, Kelly S, Malitz F, Bozzette SA. Applications of the Medical Outcomes Study health-related quality of life measures in HIV/AIDS. Qual Life Res 1997, 6: 531-543.
5. Bozzette SA, Hays RD, Barry SH, Kanouse DE, Wu AW. Derivation and properties of a brief health status assessment instrument for use in HIV disease. J AIDS 1995, 8: 253-265.
6. Huges TE, Kaplan RM, Coons SJ, Draugalis JR. et al
. Construct validities of the Quality of Well-being Scale and the MOS-HIV-34 Health Survey for HIV-infected patients. Med Decis Making 1997, 17: 439-446.
7. Smith KW, Avis NE, Mayer KH, Swislow L. Use of the MQoL-HIV with asymptomatic HIV-positive patients. Qual Life Res 1997, 6: 555-560.
8. Leplege A, Rude N, Ecosse E, Ceinos R, Dohin E, Pouchot J. Measuring quality of life from the point of view of HIV-positive subjects: the HIV-QL31. Qual Life Res 1997, 6: 585-594.
9. Ware JE Jr, Kosinski M, Keller SD. A 12-item short-form health survey: construction of scales and preliminary test of reliability and validity. Med Care 1996, 34: 220-233.
10. Hurst NP, Ruta DA, Kind P. Comparison of the MOS Short Form (SF12) health status questionnaire with the SF36 in patients with rheumatoid arthritis. Br J Rheumatol 1998, 37: 862-869.
11. Brosgart CL, Louis TA, Hillman DW, Craig CP. et al
. A randomized, placebo-controlled trial of the safety and efficacy of oral ganciclovir for prophylaxis of cytomegalovirus disease in HIV-infected individuals. AIDS 1998, 12: 269-277.
12. Wu AW, Revicki, DA, Jacobson, D, Malitz FE. Evidence for reliability, validity and usefulness of the Medical Outcomes Study HIV Health Survey (MOS-HIV). Qual Life Res, 1997, 6: 481-493.
13. Ware JE Jr. The SF-36 health survey.
In Quality of Life and Pharmacoeconomics in Clinical Trials,
2nd edn. Edited by Spilker B. Philadelphia, PA: Lippincott-Raven, 1996.
14. Hotelling H. The selection of variates for use in predication with some comments on the general problem of nuisance parameters. Ann Math Stat 1940, 11: 271–283.
15. Gandek B, Ware JE Jr, Aaronson NK. et al
. Test of data quality, scaling assumptions, and reliability of the SF-36 in eleven countries: Results from the IQOLA project. J Clin Epidemiol 1998, 51: 1149-1158.
16. CPCRA Management Team/Therapeutics Research Program. CPCRA Data Collection Handbook
. Bethesda, MD: Division of AIDS, National Institute of Allergy and Infectious Diseases, National Institutes of Health; 1995, 5
17. Centers for Disease Control and Prevention. 1993 Revised classification system for HIV infection and expanded surveillance case definitions for AIDS among adolescents and adults. MMWR 1992, 41: 1-14.
18. SAS Institute. SAS version 6.10.
Cary, NC: SAS Institute; 1996.
19. Cunningham WE, Shapiro MF, Hays RD. et al
. Constitutional symptoms and health-related quality of life in patients with symptomatic HIV disease. Am J Med 1998, 104: 129-136.
20. Leplege A, Hunt S. The problem of quality of life in medicine. JAMA 1997, 278: 47-50.
21. Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Statmed 1989, 8: 431-440.
Keywords:© 2002 Lippincott Williams & Wilkins, Inc.
quality of life; HIV; AIDS; clinical trials; progression of HIV disease