Assessing the utility of five domains in SF-12 Health Status Questionnaire in an AIDS clinical trial

Han, Conga; Pulling, Christopher C.b; Telke, Susan E.a; Huppler Hullsiek, Katherinea; for the Terry Beirn Community Programs for Clinical Research on AIDS

Clinical Science

Objective: To assess a shortened quality-of-life (QoL) measurement tool in a population with advanced HIV infection.

Design: Five domains (seven items) in a 12-item questionnaire (SF-12) were compared with those same domains in a 39-item questionnaire (SF-39). Data were collected using SF-39 in a randomized clinical trial for the prevention of cytomegalovirus disease.

Methods: The performance of SF-12 relative to SF-39 was evaluated within each domain by comparing QoL scores at baseline and over time, assessing the reliability and validity for both instruments, assessing item consistency and discrimination within instruments, and implementing event-time analyses that quantified dependence of the hazard for death and progression of disease (POD) on baseline values.

Results: Baseline measures are similar for both instruments, with high correlation within each domain. The slopes over time for the SF-12 and SF-39 domains are also similar. Both the SF-12 and SF-39 domains have satisfactory reliabilities and perfect discrimination. The hazard ratios for death and POD are similar for both instruments within a domain. All SF-12 and most SF-39 domains are highly predictive for death but are not highly predictive for POD.

Conclusions: For the domains considered, SF-12 is a reasonable and effective replacement for SF-39 in studies of patients with advanced HIV disease. SF-12 reduces item redundancy and the burden of data requirements for both investigators and patients; consequently, it may improve compliance with form completion.

Author Information

From the aDivision of Biostatistics, School of Public Health, University of Minnesota and bMedtronic, Minneapolis Mimmesota, USA.

Requests for reprints to: Katherine Huppler Hullsiek, Division of Biostatistics, University of Minnesota, 2221 University Avenue SE, Suite 200, Minneapolis, MN 55414, USA.

Received: 22 June 2001;

revised: 27 September 2001; accepted: 3 October 2001.

Sponsorship: this work was supported by cooperative agreement NIH/NIAID: N01-AI-05073 from the US National Institute of Allergy and Infectious Diseases.

Article Outline
Back to Top | Article Outline


Since the early 1970s, interest has grown over the use of health-related quality-of-life (QoL) measures in clinical research. A variety of instruments have been developed and an increasing number of QoL assessments have been reported in the literature. In fact, the number of articles that list ‘quality of life’ as a reference key word in the Medline database increased 250-fold from 1973 to 1977 [1] and have steadily increased since. QoL assessment was initially performed in studies of chronic illness, but in recent years health-related QoL has been an important consideration in virtually all aspects of health care. A review by Heyland et al. [2] found 64 studies pertaining to adult critical care published between January 1992 and July 1995, where 108 different QoL instruments were used.

QoL assessments may provide important information in evaluating treatments for chronic diseases when the treatments have associated toxicities or otherwise reduce patient well-being. The increasing number of QoL instruments being developed reflects a lack of consensus for a single instrument that adequately measures QoL with enough sensitivity to detect changes over time while maintaining an acceptable level of patient and investigator burden [3]. A discussion by Wu et al. [4] compared the overlap and differences in nine QoL instruments, which had from 12 to 56 items and were administered to over 20 000 patients with HIV. Those authors propose that selection of a QoL instrument should be based on a balance between sufficient specific questions and pragmatism.

Several QoL measurement tools, usually containing 30–40 items, have been shown to be effective instruments in measuring health-related QoL in HIV- infected individuals [5–8]. However, acceptance of QoL measures in AIDS research has been limited, in part, by concerns over questionnaire length, item redundancy, data requirements and investigator and patient burden in studies where demands on participating patients and clinicians are already high. Therefore, a brief and effective QoL instrument is preferable. Bozette et al. [5] have argued that most AIDS trials are powered to capture infrequent clinical events and the consequent large sample sizes would allow better toleration of possible reductions in sensitivity associated with shortening QoL scales.

Items in the 12-item health status questionnaire SF-12 [9] are a subset of those in the MOS SF-36, which addresses physical functioning, general health, pain, mental health, energy/fatigue and social and role functioning. Ware and colleagues have shown that SF-12 is a reliable QoL measure relative to SF-36 in the general population of the United States [9]. SF-12 accounted for more than 90% of the variance in the SF-36 physical and mental component summary measures. Also, it accurately reproduced average scores for both of these SF-36 summary measures. Hurst and colleagues reported similar results for patients with rheumatoid arthritis [10].

The data used in this study are from a randomized clinical trial of oral ganciclovir or placebo for the prevention of cytomegalovirus (CMV) disease in 995 patients with AIDS, sponsored by the Terry Beirn Community Programs for Clinical Research on AIDS (CPCRA 023 trial) [11]. Patients were randomized into the study from November 1992 to December 1993; data collection continued until the trial was concluded in 1995. The CPCRA is a consortium of 16 clinical units in 13 cities in the United States conducting trials at sites where patients with HIV infection receive their primary care. SF-39 was administered to the study population. This questionnaire is essentially identical to the widely used MOS HIV [12], except the physical functioning domain is taken from the quite similar MOS SF-36 [13].

The present study compares seven items from SF-12 with 26 items in SF-39, forming the domains physical functioning, general health, pain, mental health and energy/fatigue. The SF-12 scores were extracted from information gathered via administration of the SF-39.

Back to Top | Article Outline


SF-39 served as the reference standard in the evaluation of the amount of information lost by using the SF-12 subset of items. The five domains studied were those items in SF-12 that formed a subset of items in SF-39, namely physical functioning, general health, pain, mental health and energy/fatigue. The social and role functioning domains in SF-12 were not included in the analysis because they are not included in SF-39. In addition, several domains of SF-39, namely cognitive functioning, health distress, quality of life and health transition, do not have a counterpart in SF-12. For each of the five domains included here, items in SF-39 that do not appear in SF-12 are referred to as the ‘SF-12 complement’ items. There are 26 items in SF-39 that are classified as being in SF-12 or SF-12 complement. Each item considered has from three to six response categories. All items were rescaled to a number between 0 and 100, where a higher score reflects a better QoL.

The study population consisted of 742 patients, which were 75% of the 995 patients enrolled in the CPCRA 023 study. The 253 excluded patients had incomplete baseline data on the 26 items of the SF-39. Among the 253 patients excluded, 147 had all QoL items missing, 60 had one item missing and 36 had between two and four items missing. At or before enrollment, all patients had at least one CD4 cell count ≤ 100 × 106 cells/l, a positive CMV serology or positive CMV culture but no history of CMV disease, and were at least 13 years of age. Patients were randomized in a 2:1 ratio to receive oral ganciclovir or placebo. Patients completed the SF-39 questionnaire within 30 days prior to randomization, after 1 month of follow-up and every 4 months thereafter. The average duration of follow-up was 18 months.

To compare SF-12 with SF-39, the distributions of baseline scores were compared by evaluating average scores and standard deviations for each domain. The correlation between baseline SF-12 and SF-39 domain scores were computed and responsiveness to change, via trends in scores over the course of the clinical trial, were compared. For this, estimated patient-specific slopes from SF-12 were correlated with those from SF-39. Such correlations reflect the criterion-related validity of SF-12 when SF-39 is considered a criterion. Correlations between QoL scores at baseline and at 1 month within an instrument were examined to assess test–retest reliability. The analyses were repeated for subgroups defined by gender (male and female), race (Latino, Black, Caucasian, other), CD4 cell count (≤ 25, 26–50, 51–100 and > 100 × 106 cells/l) and prior intravenous drug use (dichotomous). For domains with multiple items, Cronbach's coefficient alpha was also computed as an index of internal consistency. The proportion of patients scoring at the floor (lowest possible score) and the ceiling (highest possible score) of each domain was also calculated.

Item discrimination was assessed for each item and for each of the domains to which it did not belong (non-target domains) by testing whether or not the item had a higher correlation with its own domain than with the non-target domain. Whenever the former correlation was significantly higher (at 0.05 level) than the latter, a scaling success was said to have occurred. Note that since there are five domains, each item will have four such comparisons, and a q-item domain will have 4 q such comparisons. For any item in a multi-item domain, Hotelling's test for two related correlation coefficients was applied [14]. For the single-item domains, this test is not applicable since the item necessarily correlates perfectly with its domain; in such cases, a one-sided 95% confidence interval was constructed for the correlation coefficient between the item and a non-target domain. Significance was indicated whenever the upper limit of this confidence interval was < 1. The scaling success rate of a domain was taken as the proportion of scaling successes among all comparisons [15].

Proportional hazard regression models were used to compare the influence of baseline QoL scores on the hazard of death and the hazard of progression of HIV disease (POD) for each domain. POD is defined as the occurrence of any one of 21 AIDS-defining opportunistic infections or of death [16,17]. To compare SF-12 with SF-39, three proportional hazards regression models were evaluated for each domain. Model I used the SF-39 QoL score as a regressor; model II used the SF-12 score, and model III used the SF-12 score and the SF-12 complement score. Comparison of models I and II provided information as to whether SF-12 could predict death and POD as well as SF-39 did; examination of the effects of the complement scores in model III provided information about whether the extra items in SF-39 have any predictive power after controlling for the SF-12 items. All models are stratified by clinical unit and adjusted for randomization group and baseline CD4 cell count (square root scale); analyses were done both with and without adjusting for baseline Karnofsky score: that is, when adjusting for Karnofsky score, 15 models were fitted for death and 15 models for POD. For either endpoint and each domain, there are three models (I, II and III) fitted. The same was true when there was no adjustment for baseline Karnofsky score. All analyses were conducted using SAS software (SAS Institute, Cary, North Carolina, USA) [18].

Back to Top | Article Outline


Table 1 lists the 26 items in SF-39 and classifies each item as being in SF-12 or SF-12 complement. Table 2 presents selected baseline characteristics. The cohort under study consisted primarily of white men between the ages of 30 and 49 years with a history of homosexual or bisexual contact. CD4 cell counts were ≤ 100 × 106 cells/l or less for 93% of the cohort. Most (67.7%) of the cohort had a Karnofsky score of 80–90.

Table 3 presents the average quality of life score for each domain at baseline and change over time for SF-39, SF-12, and SF-12 complement. For the baseline scores, the means for SF-12 and SF-39 were fairly similar within each domain. Although the SF-12 domain had fewer items than the SF-39 domain or the SF-12 complement, the standard deviations in each domain were also in close agreement between the instruments, suggesting that the subset of items comprising SF-12 are representative of the full scale. An examination of boxplots (not shown) revealed that the distributions of SF-39, SF-12 and SF-12 complement domain scores were similar across gender, race, age, CD4 cell count categories and injection drug use. QoL measures over time are considered more important in clinical trials than measures at baseline. The average slopes and corresponding standard errors of the changes over time between the two instruments were in close agreement. Table 3 also shows that SF-12 does have higher proportions of patients who score at either extreme. This is not surprising since SF-12 has fewer possible scores.

Table 4 summarizes the results of correlation analyses. P values for each correlation coefficient are < 0.0001. Pearson product moment correlation between SF-12 and SF-39 was computed for each of the five domains for baseline scores and for the slopes of change for individual participants. These correlation coefficients ranged from 0.77 to 0.94 for baseline scores, and from 0.61 to 0.92 for change over time. Viewing the SF-39 domain scores as criteria, these correlations provide information about criterion-related validity of the SF-12. Since the coefficients of determination (i.e., the squared correlation coefficients) ranged from approximately 0.60 to almost 0.90, we conclude that SF-12 has high validity. The high correlations between SF-12 and SF-39 provide more persuasive evidence about the equivalence of the two instruments than the similarities in the mean and standard errors seen in Table 3, as the high correlations indicate that the two instruments are measuring the same underlying theoretical constructs. These correlation analyses were repeated in subgroups defined by gender, race, age, CD4 cell count categories and injection drug use; the results in each subgroup were similar to those in the whole cohort (results not shown).

The test–retest reliabilities were assessed using both inter- and intraclass correlation coefficients. Table 5 shows that the intraclass correlations were above 0.60, except for the pain domain of SF-12, indicating satisfactory stability across occasions. As will be discussed later, these test–retest correlation coefficients may underestimate the true reliability of the domains. Interclass correlation coefficients were also computed (results not shown) and were almost identical to the intraclass correlation coefficients. The test–retest correlations were also assessed for subgroups defined by gender, race, and CD4 cell count; the results (not shown) were similar to those obtained for the whole cohort.

The reliabilities of the five domains for SF-39 and SF-12 were assessed by internal consistency and test–retest correlation. Table 6 lists Cronbach's coefficient alpha for each of the domains for which it is applicable. The SF-39 domains had values ranging from 0.85 to 0.92, indicating excellent consistency. Cronbach's coefficient alpha cannot be computed for three SF-12 domains with only one item; for the two domains with more than one item, alpha is 0.68 for the mental health domain, and 0.78 for the physical functioning domain.

Item internal consistency is represented by the corrected item–total correlation, that is, correlation between an item and the sum of all the other items in the same domain. Table 6 shows that SF-39 items have excellent internal consistencies, with all but one item–total correlation being above 0.60. The SF-12 items also have good internal consistencies. Note that the corrected item–total correlation coefficients for SF-12 are just the correlation between two items for the two-item scales.

Item discrimination is a concept that indicates an item should have a higher correlation with the scale to which it belongs than with any other scale. For each item, comparison of correlations with target and non-target domains showed that all the domains of SF-12 and SF-39 had 100% scaling success rate, indicating perfect item discrimination.

Table 7 presents proportional hazards regression results for the event death. Since the influence of baseline Karnofsky score is essentially constant for the three models, a typical hazard ratio is reported (e.g., for the physical functioning domain, the hazard ratio for Karnofsky score is very close to 0.92 per 10 unit change in each of the three models). Karnofsky score is predictive for death in all domains other than physical functioning. For all domains other than mental health, the SF-39 and SF-12 scores were highly predictive of death (models I and II, respectively). For model III and those same domains, the SF-12 score was predictive and the SF-12 complement score was not, suggesting that the SF-12 complement score adds little information for predicting death when the influence of the SF-12 score is accounted for. For the mental health domain, the SF-39 QoL score was not predictive for death. The SF-12 and SF-12 complement scores in models II and III were predictive for death in the mental health domain. Interestingly, the hazard ratio for the SF-12 complement score in model III was in the opposite direction from that for the SF-12 score. The SF-12 and SF-12 complement scores are positively correlated (r = 0.83) and it is not clear why this anomaly occurs. When the analyses were repeated without adjusting for Karnofsky score, the results were similar except for model III of the pain and mental health domains. For the pain domain, the SF-12 score became significant, with a hazard ratio of 0.73 and a P value of 0.02; for the mental health domain, the compliment score became insignificant, with a hazard ratio of 1.38 and a P value of 0.12.

Table 8 displays results for the progression of disease endpoint. Karnofsky score was not predictive of POD in any of the domains. Both the SF-12 and the SF-39 scores were predictive (models I and II) for a POD event in the physical functioning, general health, and energy/fatigue domains; neither was predictive in the pain and mental health domains. With model III, the SF-12 score was predictive only for the physical functioning and mental health domains. These analyses were repeated without adjusting for Karnofsky score and the results were almost identical.

Back to Top | Article Outline


In this population of HIV patients with advanced disease, data suggest that SF-12 may be an effective substitute for the SF-39. For the five domains studied, SF-12 contains much of the information present in the longer SF-39. Use of SF-12 would reduce the burden to patients and clinicians created by collecting health-related QoL information. The results described here are also similar to those of Hurst et al. [10], who compared SF12 and SF-36 in patients with rheumatoid arthritis.

Our evaluations show there is little overall difference in the baseline mean scores between SF-39 and SF-12. The standard deviations are also similar for SF-39 and SF-12, and there is high correlation among baseline values. The similarity and high correlations between the regression slopes of the SF-12 and SF-39 scores appear to indicate that the SF-12 domain scores are as sensitive to change over time as the SF-39 scores. Since the patient-specific regression slopes are estimated from only a few data points and have large standard errors, the estimated correlation is attenuated towards zero from the true, structural relation, making the high correlations we see even more compelling. Also, the estimated correlations are smaller than those estimated from more diverse, general populations [9], another manifestation of attenuation.

The reliabilities of the five domains in SF-39 and two of the domains in SF-12 are good. Using test–retest correlation to assess reliability assumes that a test score consists of a ‘true’ score and measurement error. If there are other sources of variation that are independent of the true score, then test–retest correlation is only a lower bound for reliability and, hence, tends to underestimate reliability.

SF-12 is just as effective as SF-39 in predicting death and POD, and hazard ratios for the SF-12 regressors depend little on whether the SF-12 complement is included as a regressor, indicating that most of the information contained in SF-39 for predicting clinical events is also present in SF-12. Both questionnaires were less effective in predicting POD than death. This reduction in association may be caused, in part, by the fact that the POD endpoint includes a variety of clinical events with a variety of etiologies and associated markers. Many are relatively minor infections, which may not associate with a patient's health-related QoL.

SF-39 does include information not captured by SF-12 (e.g., cognitive functioning, health distress and health transition domains). Hence, use of SF-12 in place of the longer questionnaire(s) is subject to the assumption that cognitive functioning, health distress and health transition domains are not of interest, either as outcomes or as predictors, to the researchers.

SF-12 values were extracted from information gathered via the SF-39. Consequently, relations and conclusions may be different for information gathered by direct administration of SF-12. It is possible that the items in SF-39 but not in SF-12 provide a context which orients patients to QoL issues; hence the SF-12 items administered as part of SF-39 might have provided more information than the information that would have been gathered had SF-12 itself been administered. It is also possible, however, that fewer items may help patients to concentrate better and hence reduce the amount of ‘noise’ in their responses. Furthermore, our study population consisted of patients with late-stage AIDS and our findings may not generalize to other HIV-infected groups. For example, SF-12 does not include the SF-39 questions of bowling and golfing limitations, and these may be pertinent to healthier persons living with HIV/AIDS.

Another source of limitation is the fact that the patient population in the present study consisted predominantly of white homosexual and bisexual male patients. Caution needs to be exercised in generalizing such results to other patient populations. In addition, the trial that yielded the data analyzed in this paper ended in 1995, and hence some conclusions may also be temporally limited.

In addition to using common tools in questionnaire assessment, we have also used proportional hazard regression in our study to investigate the predictive value of QoL with respect to progression, disease and death. Cunningham et al. [19] conducted a study that showed the presence, number and severity of constitutional symptoms in HIV disease, among symptomatic individuals, is strongly related to health-related QoL. However, the predictive power of health-related QoL in predicting clinical events was not assessed. Wu et al. [12] cited a study of patients with CD4 cell counts < 100 × 106 cells/l; in this group, a one point difference in baseline MOS-HIV score was related to a 4% increase in hazard for death. Future analyses utilizing this method may include additional endpoints such as adverse events, loss to follow-up and treatment compliance.

Although baseline scores from the SF-12 domains were as effective as those from the SF-39 in predicting clinical events, such relations are not the principal goal of QoL assessment. Rather, QoL assessment is most commonly used to complement and supplement treatment comparisons based on clinical endpoints (and not as a correlate of these outcomes). Therefore, high predictive power is not necessarily an advantage. In fact, Leplege [20] argues for a more existential approach to measuring health-related QoL, shifting attention to methods of assessment more capable of reflecting individuals’ views of their health status rather than a medical interpretation of health status. The need to identify a treatment comparison role for QoL assessment that is not captured by clinical endpoints would be highlighted if similarly high predictive power emerging from an evaluation of QoL as a time-varying covariate [21] is seen.

Back to Top | Article Outline


We thank the patients and clinicians who participated in the CMV study and especially thank Tom Louis, Jim Neaton and others who commented on our work. We thank the anonymous referees whose comments greatly improved this manuscript. The authors are grateful to the following University of Minnesota students for their assistance: Karen A. Clifton, Julie M. Heyd, Yijian Huang and Erika J. Rothe.

Back to Top | Article Outline


1. Testa MA, Simonson DC. Assessment of quality-of-life outcomes. N Engl J Med 1996, 334: 835-840.
2. Heyland DK, Guyatt G, Cook DJ. et al. Frequency and methodologic rigor of quality-of-life assessments in the critical care literature. Crit Care Med 1998, 26: 591-598.
3. Franchi D, Wenzel RP. Measuring health-related quality of life among patients infected with human immunodeficiency virus. Clin Infect Dis 1998, 26: 20-26.
4. Wu AW, Hays RD, Kelly S, Malitz F, Bozzette SA. Applications of the Medical Outcomes Study health-related quality of life measures in HIV/AIDS. Qual Life Res 1997, 6: 531-543.
5. Bozzette SA, Hays RD, Barry SH, Kanouse DE, Wu AW. Derivation and properties of a brief health status assessment instrument for use in HIV disease. J AIDS 1995, 8: 253-265.
6. Huges TE, Kaplan RM, Coons SJ, Draugalis JR. et al. Construct validities of the Quality of Well-being Scale and the MOS-HIV-34 Health Survey for HIV-infected patients. Med Decis Making 1997, 17: 439-446.
7. Smith KW, Avis NE, Mayer KH, Swislow L. Use of the MQoL-HIV with asymptomatic HIV-positive patients. Qual Life Res 1997, 6: 555-560.
8. Leplege A, Rude N, Ecosse E, Ceinos R, Dohin E, Pouchot J. Measuring quality of life from the point of view of HIV-positive subjects: the HIV-QL31. Qual Life Res 1997, 6: 585-594.
9. Ware JE Jr, Kosinski M, Keller SD. A 12-item short-form health survey: construction of scales and preliminary test of reliability and validity. Med Care 1996, 34: 220-233.
10. Hurst NP, Ruta DA, Kind P. Comparison of the MOS Short Form (SF12) health status questionnaire with the SF36 in patients with rheumatoid arthritis. Br J Rheumatol 1998, 37: 862-869.
11. Brosgart CL, Louis TA, Hillman DW, Craig CP. et al. A randomized, placebo-controlled trial of the safety and efficacy of oral ganciclovir for prophylaxis of cytomegalovirus disease in HIV-infected individuals. AIDS 1998, 12: 269-277.
12. Wu AW, Revicki, DA, Jacobson, D, Malitz FE. Evidence for reliability, validity and usefulness of the Medical Outcomes Study HIV Health Survey (MOS-HIV). Qual Life Res, 1997, 6: 481-493.
13. Ware JE Jr. The SF-36 health survey. In Quality of Life and Pharmacoeconomics in Clinical Trials, 2nd edn. Edited by Spilker B. Philadelphia, PA: Lippincott-Raven, 1996.
14. Hotelling H. The selection of variates for use in predication with some comments on the general problem of nuisance parameters. Ann Math Stat 1940, 11: 271–283.
15. Gandek B, Ware JE Jr, Aaronson NK. et al. Test of data quality, scaling assumptions, and reliability of the SF-36 in eleven countries: Results from the IQOLA project. J Clin Epidemiol 1998, 51: 1149-1158.
16. CPCRA Management Team/Therapeutics Research Program. CPCRA Data Collection Handbook. Bethesda, MD: Division of AIDS, National Institute of Allergy and Infectious Diseases, National Institutes of Health; 1995, 5: 11.
17. Centers for Disease Control and Prevention. 1993 Revised classification system for HIV infection and expanded surveillance case definitions for AIDS among adolescents and adults. MMWR 1992, 41: 1-14.
18. SAS Institute. SAS version 6.10. Cary, NC: SAS Institute; 1996.
19. Cunningham WE, Shapiro MF, Hays RD. et al. Constitutional symptoms and health-related quality of life in patients with symptomatic HIV disease. Am J Med 1998, 104: 129-136.
20. Leplege A, Hunt S. The problem of quality of life in medicine. JAMA 1997, 278: 47-50.
21. Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Statmed 1989, 8: 431-440.

quality of life; HIV; AIDS; clinical trials; progression of HIV disease

© 2002 Lippincott Williams & Wilkins, Inc.