Measurement of quality of life (QoL) is now incorporated into the design of many randomized clinical trials and other research studies as a standard endpoint. It is a particularly important feature of trials that involve patients with chronic disease which may or may not be curable. Accurate and reliable measurement of QoL is pivotal to the success of any trial which seeks to detect a significant change in QoL as a consequence of disease treatment and/or progression.
Melanoma can have a considerable impact on patients’ lives, including their health-related QoL which the WHO defines 1 as ‘an individual’s perception of their position in life in the context of the culture and value system in which they live and in relation to their goals, expectations, standards and concerns’. Health-related QoL is a wide-ranging and complex concept that is uniquely influenced by the physical health, psychological state, level of independence, social relationships, personal beliefs and the relationship with other salient features of the personal environment of each individual. Recent advances in the systemic treatment of metastatic melanoma have brought new hope to individuals with this form of skin cancer 2. As new agents move into adjuvant trials 3, a range of new side effects, which are different from those of standard chemotherapy, will be experienced by more patients, many of whom would not have relapsed. It is essential, therefore, to accurately measure the changes in QoL that patients’ experience, in addition to survival, to comprehensively evaluate the overall benefits of new therapeutic approaches.
The FACT-Melanoma (FACT-M) is a valid and reliable instrument that has been used for the assessment of QoL in melanoma clinical trials, and is one of only two disease-specific instruments designed for use in patients with melanoma. This instrument incorporates the items from the cancer-generic questionnaire, the FACT-G 4, followed by questionnaire items that are specific to melanoma. The FACT-M has been validated for use in patients with stage I–IV melanoma and has been shown to be reliable, responsive and to have good construct and convergent validity 5. Respondents are asked to score statements related to their QoL with melanoma, based on a five-point response scale (not at all, a little bit, somewhat, quite a bit and very much). The second instrument 6, the malignant melanoma module, comprises 11 items with a four-point response format. It was used in a longitudinal QoL study comprising 89 patients, in 1987, in Sweden; however, no further validation studies have been published on the malignant melanoma module since 1993.
Until recently, the structure and the response of the FACT-M had not been tested using modern item response theory-based statistical models in detail. One such approach is the Rasch analysis, named after the Danish mathematician George Rasch, originally presented for dichotomous data 7. This family of approaches has become more accessible lately, owing to the development and availability of user-friendly software (e.g. RUMM2030) 8. The Rasch analysis has become increasingly accepted across all research disciplines as an improved method of developing scales 9–11.
Many research instruments require respondents to score a set of items to indicate increasing levels of a response on some variable; in this case, the patient’s level of QoL. Summing the scores of the items to give a single score for a person implies that the items are intended to measure a single variable, often referred to as a unidimensional variable. Many ordinal scales are treated as if they provide interval level measurement, however, ordinal scales of measurement do not support the mathematical operations needed to calculate means and SD 12. The analyses of data according to the Rasch model provide a range of fit statistics to check whether or not the adding of scores of a research instrument is justified. If the data fit the model adequately for the purpose, the linearized total score can be used in subsequent parametric analyses more readily than the raw total score.
The aims of this study were to:
- test the original factor structure and response format of the current FACT-M for goodness of fit to the Rasch model, using RUMM2030 software, and
- to compare goodness of fit to the Rasch model using a modified FACT-M with a four-point response format.
All necessary relevant permissions were granted by the Ethics Committees of the Royal Prince Alfred Hospital, Sydney, and the University of Sydney. The study was financially supported by the University of Sydney Cancer Research Fund, the Australia and New Zealand Melanoma Trials Group (ANZMTG) and Provectus Pharmaceuticals.
Participant data sets
Two data sets were used with the consent of the chief investigators of the following two studies.
Melanoma Quality of Care study, data collection August 2009 to October 2009
The original version of the FACT-M was completed by 127 respondents with stages I–IV melanoma and scored according to the guidelines in the Functional Assessment of Chronic Illness Therapy (FACIT) manual. A higher score indicates higher QoL, and scores from this sample of patients were severely skewed towards the higher end of the QoL scale. Table 1 shows the scores obtained from the Melanoma Institute Australia (MIA) group, benchmarked against data from two published studies involving melanoma patients 5,13. Using the core instrument only, the FACT-G total score was similar between the MIA patients and the published data by Smith and colleagues; however, inspection of the individual scales revealed the Social Well-being (SWB) subscale was higher by more than 6 points and the Emotional Well-being (EWB) subscale lower by 3 points for the MIA group compared with the Smith and colleagues’ data (Table 1).
Improving Quality of Life measurement for melanoma patients and their families: validity and reliability study of quality-of-life instruments in an Australian sample, data collection July 2010 to November 2010
Authors of the present study, together with colleagues at MIA, ANZMTG and the Psycho-oncology Cooperative Research Group (PoCoG), University of Sydney, were awarded a grant by the University of Sydney to conduct a validation study of the most frequently used instruments to measure QoL in a melanoma population in Australia. As part of this study, a modified version of the FACT-M using a four-point response format was trialed. The purpose of collating these data was simply to test how the four-point response format for each item on the modified version differed from the existing version using a five-point scale, by conducting a series of Rasch analyses. No comparative data for subscales and total score on the FACT-M were calculated using this dataset.
Data sets were merged and initially prepared using IBM SPSS for Windows (version 19) 14, and variables were screened for missing values. Respondents with missing values were omitted from the analyses; this preserved the original complete data sets, rather than using imputation methods. However, respondents with occasional missing values for demographic characteristics (sex, age group and stage of melanoma) were retained. Preliminary descriptive analysis of responses to the items on the two versions of the FACT-M was conducted and checked for restriction in range; that is, where two responses accounted for more than 95% of responses 15. These items were retained in this analysis, however, unless results from the Rasch analysis indicated that they should be removed.
Differences in the demographic characteristics of the respondents in the two data sets (five-point response and four-point response) were analysed using appropriate nonparametric statistical tests in SPSS. Subscale reliability, before Rasch analysis, was assessed using Cronbach α coefficients 16. Principal components analysis with oblimin rotation was chosen to test the original fit of the two melanoma-specific subscales of the instrument. A fixed format data file was prepared and imported to RUMM2030 software, to test the unidimensionality of the individual subscales using Rasch analysis.
Background to Rasch analysis
The essential rule in successful measurement is used universally for money, length, area, weight and temperature; that rule is ‘one more unit means the same amount extra, no matter how much there already is’. This is exactly what Rasch measurement operationalizes for social science 17. Fitting data to the Rasch model addresses several key methodological aspects associated with scale development and construct validation, in addition to providing a linear transformation of the ordinal raw score 18. Analysis of data according to the Rasch model provides a range of fit statistics to determine whether or not adding the scores is justified in the data. If the invariance of responses across different groups of individuals does not hold, then taking the total score to characterize a person is not justified. Data never fit the model perfectly, and it is important to consider the fit of data to the model with respect to the uses to be made of the total scores. If the data fit the model adequately for the purpose, the linearized total score can be used in subsequent parametric analyses more readily than the raw total score, which may exhibit floor and ceiling effects.
A series of Rasch analyses were conducted on the six subscales of the FACT-M, using the RUMM2030 software. The default procedure for RUMM uses the partial credit model, which allows items to have varying numbers of response categories and does not assume that the distance between response thresholds is uniform. The following summary statistics were used to assess model fit using guidelines developed by Pallant and Tennant 18.
The initial solution was checked for convergence and model fit, assessed by a range of summary statistics. A well-fitting solution would be indicated by a probability from the item–trait interaction χ2 greater than 0.05, after dividing by the number of items in the scale (Bonferonni correction) 19. For this analysis, fit residual values, for both person and item, were inspected; a mean close to zero and an SD less than 1.5 were desirable. Individual item fit residual values greater than +2.5 were considered to indicate a misfit and less than −2.5 to indicate item redundancy. Internal consistency was assessed using the Person Separation Index (PSI), with values above 0.7 considered acceptable. Threshold maps were inspected for significant disordering, which would indicate inconsistent use of the response options.
Differential item functioning (DIF) was checked for possible item bias caused by the responses by different groups in the sample. This study assessed DIF for three factors; sex of patient (male vs. female), patient age (<50, 50 to <65 and 65+) and stage of disease (local vs. metastatic). Person–item threshold maps were plotted to assess whether the FACT-M appropriately targeted the respondent group. Lastly, dimensionality was assessed using t-tests to compare person estimates derived from the two most disparate subsets of scale items 20. A threshold level of less than 5% was considered acceptable.
Both data sets were screened for missing responses to the items of the FACT-M. Data were collected from 127 respondents using the ‘original’ version of the FACT-M (five-point response format) and 123 respondents using the ‘modified’ FACT-M (four-point response format). Occasional missing values were accepted for participant demographic variables (shown in Table 2) together with the significance values for differences detected between the two groups of respondents; now referred to as ‘original’ five-point version and ‘modified’ four-point version. The respondents of the two groups were of similar age, disease status and American Joint Committee on Cancer (AJCC) staging of first primary. However, the ‘original’ group included 37% men compared with 47% men in the ‘modified’ group (P=0.014).
Item frequencies for all questions on the FACT-G, the Additional Concerns (AC) subscale and Melanoma Surgery (MSS) subscale were inspected for restriction in range. Item frequencies for the two melanoma-specific subscales are shown in Tables 3 and 4. On average, the middle response category ‘somewhat’ was used just less than 7% of the time in the ‘original’ version of the AC subscale and 5% of the time for the MSS.
The results of the Rasch analyses for each of the subscales of the original and the modified version of the FACT-M were as follows:
The original Physical Well-being subscale comprised seven items scored on a five-point scale. The summary fit statistics for the original seven-item scale were all within normal limits; however, in this sample, the PSI was low (0.458, Table 5). The first Rasch analysis of this subscale showed that the responses for six out of seven items showed slight disordering (Fig. 1). On inspection, the category probability curves indicated that this may have been because of patient responses to the middle categories ‘a little bit’ and ‘somewhat’ (Fig. 2).
The threshold map for the items scored on a four-point scale reduced the extent of disordering except for one item, ‘I have nausea’. Again, the fit statistics for the seven-item scale were all within normal limits and the PSI improved to 0.605. The percentage of equating t-tests below 5% was 1.68%, which supports the unidimensionality of the scale. No DIF for either group (sex, age or disease group) was found. In this sample, the four-point response format slightly improved the goodness of fit to the Rasch model.
The original SWB subscale comprised seven items scored on a five-point scale. The threshold map showed some disordering for all seven items, again most likely to be associated with the middle categories ‘a little bit’ and ‘somewhat’. This result mirrored the analysis of the Physical Well-being subscale and supported the theory that rescoring might improve the goodness of fit. The probability of the goodness-of-fit χ2 at this point was below the recommended level of 0.007 (for a seven-item scale), indicating the fit could be improved. The PSI was also low at 0.507 (Table 5).
The threshold map for the seven items of the SWB subscale scored on a four-point scale corrected all items for disordering. The summary statistics, however, still showed evidence of item misfit with a 2.36 SD. The fit residual for the item ‘I am satisfied with my sex life’ was +5.078 indicating misfit of this item to the SWB subscale. This item was removed and fit statistics for the six-item scale were all within normal limits, with an improved PSI of 0.670 (Table 5).
The original EWB subscale comprised six items scored on a five-point scale. The probability of the goodness-of-fit χ2 at this point was highly significant (P<0.0001), indicating poor fit. The PSI was slightly low at 0.620 (Table 5). The threshold map indicated some disordering, again associated with middle categories ‘a little bit’ and ‘somewhat’.
The threshold map for the six items of the EWB subscale scored on a four-point scale corrected the items for disordering, except for item 6, ‘I worry that my condition will get worse’. The goodness-of-fit χ2 still showed evidence of misfit (P=0.0004). The fit residual for item 2 ‘I am satisfied with how I am coping with my illness’ was +3.052 indicating misfit of this item to the subscale. This item was removed and fit statistics for the five-item scale were all within normal limits, with an improved PSI of 0.699 (Table 5).
The original functional well-being subscale comprised seven items scored on a five-point scale. The summary statistics showed some evidence of item misfit with a slightly high SD of 1.748. The probability of the goodness-of-fit χ2 at this point was 0.0008, indicating poor fit, although the PSI of 0.704 was good (Table 5). The threshold map showed slight disordering for most of the items.
In the first analysis, the summary statistics for the seven items of the functional well-being subscale scored on a four-point scale were very similar to the figures for the five-point scale. This was surprising because the rescoring had corrected all items for disordering. The goodness-of-fit χ2 still showed evidence of misfit (P=0.0009), with an item fit residual of −2.561 for ‘I am able to enjoy life’. After deletion of item 3, fit statistics for the six-item scale were all within normal limits, with an improved PSI of 0.699, Table 3). Percentage of equating t-tests below 5% was 1.57%, which supports the unidimensionality of the scale. No DIF for either group (sex, age or disease group) could be found.
Analysis of Additional Concerns and Melanoma Surgery subscales
Additional Concerns Subscale
Five-point response: The AC subscale comprised 16 items scored on a five-point scale. The threshold map showed some disordering in the category probability curves for 14 out of the 16 items. The summary statistics, however, were all within normal range with an overall χ2 of 37.43 (d.f.=32), indicating no significant misfit. However, as the χ2 statistic is sensitive to sample size and the ratio of items to respondents was low (16 items to 127 participants), a check was performed using the adjusted χ2 option in RUMM2030. If the sample size had been larger (n=160), significant misfit would have been detected.
Four-point response: In the first analysis, the summary statistics showed inadequate fit for the 16-item subscale. The item fit residual χ2 for M6 ‘I have noticed blood in my stool’ was significant so, in the next step, this was deleted. The fit improved but further analysis indicated item M2 ‘I have noticed new changes in my skin (lumps, bumps, colour)’ could also be deleted. The overall fit statistics for the final 14-item scale and the PSI value were within acceptable normal limits.
Melanoma Surgery subscale
Five-point response: The MSS comprised eight items scored on a five-point scale. The threshold map showed that the category probability curves for seven of the eight items showed disordering. The summary statistics indicated substantial misfit.
Four-point response: In the first analysis, the summary statistics showed poor fit for the eight-item subscale. The threshold map showed disordering for five of the eight items and goodness-of-fit χ2 showed evidence of misfit (P<0.0001). Item fit residuals indicated that removal of M17 and M16 would improve fit to the Rasch model. Summary fit statistics for new six-item scale were within acceptable limits (Table 3).
Applications of the Rasch model are useful to inform the quality of the measurement properties of existing instruments. The present study provides an example of the utility of this analysis and the practical application to an aspect of cancer QoL measurement. The analysis of data according to the Rasch model provides a range of fit statistics to check whether or not a sum score is optimal. If the invariance of responses across different groups of individuals does not hold, then taking the total score to characterize a person may be inaccurate or misleading.
The analyses reported here highlight two aspects of the psychometric properties; (a) whether the use of the five-point or the four-point response format and (b) whether the deletion of items, should be considered to improve the measurement properties of the FACT-M.
In the literature associated with measurement, it remains a subject of much debate as to the number of response categories that work ‘best’ for instruments of this type, with a general rule that the minimum number of categories should be in the region of five to seven 21. Interestingly, this is contrary to the methodology adopted by the European Organization for Research and Treatment of Cancer (EORTC) Quality of Life Group, which recommends the use of a four-point response format for cancer disease-specific modules 22 and the evidence-based guidelines published by Khadka et al.23, which recommend a maximum of five categories for most ratings and the use of nonoverlapping categories.
It is not unusual for scales developed earlier to be modified later, usually involving a reduction in items. Sophisticated software now allows improved analyses to be carried out, not previously available. For example, lead author, J.W., has recently published an updated analysis of the Manchester Clinical Supervision Scale (MCSS), which reduced the instrument from 36 to 26 items, and from seven to six factors 24. For the MCSS, no rescoring was needed as the five-point option proved reliable for the set of items used in the MCSS. The middle-point option gave the respondent the chance to say ‘neither agree nor disagree’, which, depending on the phrasing of the item, is often useful and sometimes necessary for the item to make sense. The findings here strongly suggest that confusion exists between the patient responses to the options ‘a little bit’ and ‘somewhat’, which are conceptually overlapping categories and interpretation may be variable. If a five-point response format is to be retained, with a term placed between ‘a little bit’ and ‘quite a bit’ the measurement properties might be improved by considering an alternative term, for example, ‘moderately’. Any loss of measurement quality at the level of rating scale may degrade the quality of clinical studies, which utilize patient reported outcomes measures 25.
The deletion of items from an existing scale should be avoided as some items may be important in the context of clinical management 26. In this respect, retaining the two items in the AC subscale; M6 ‘I have noticed blood in my stool’ and M2 ‘I have noticed new changes in my skin (lumps, bumps, colour)’ might be preferable. Three of the original FACT-G items showed evidence of misfit within the scale to which they were assigned. Further large-scale studies could determine whether inclusion or deletion of these questions, and/or the use of a four-point response format, would produce more reliable measurement.
The results of the present study point to possible improvements in the structure and response format of the FACT-M for use in future melanoma clinical trials. These observations are based on two relatively small samples of melanoma patients being treated in Sydney, Australia, and the study did not adopt a randomized design, similar to the one reported by Osteras et al.26. However, the two patient groups were well matched in terms of demographics, except for patient sex. A much larger study with international participation will be required to examine whether the results can be generalized across other language versions and other cultures.
This study was funded by the University of Sydney Research Fund and Australia and New Zealand Clinical Trials Group.
Conflicts of interest
There are no conflicts of interest.
1. . The World Health Organization Quality of Life
Assessment: position paper from the World Health Organization. Soc Sci Med. 1995;41:1403
2. Andrew P New drug to help body fight melanoma
. Sydney Morning Herald, 28 March 2011
3. Robinson D, Cormier J, Zhao N, Uhlar C, Revicki D, Cella D. Health-related quality of life
among patients with metastatic melanoma
: results from an international phase 2 multicenter study. Melanoma
4. Cella DF, Tulsky DS, Gray G, Sarafian B, Linn E, Bonomi A, et al. The Functional Assessment of Cancer Therapy Scale: development and validation of the general measure. J Clin Oncol. 1993;11:570–579
5. Cormier J, Davidson L, Webster K, Cella D, Xing Y, Ross M, et al. Prospective assessment of the reliability, validity, and sensitivity to change of the functional assessment of cancer therapy-melanoma
) questionnaire. Cancer. 2008;112:2249–2257
6. Sigurdardottir V, Bolund C, Brandberg Y, Sullivan M. The impact of generalised malignant melanoma
on quality of life
evaluated by the EORTC questionnaire technique. Qual Life Res. 1993;2:193–203
7. Rasch G Probabilistic models for some intelligence and attainment tests. Danish Institute for Educational Research, Copenhagen 1960. 1980 Expanded edition The University of Chicago Press
8. RUMM Laboratory Pty Ltd: RUMM2030 for Windows, Western Australia, 2011. Available at: http://www.rummlab.com.au
[Accessed 27 November 2012]
9. Bond T, Fox C Applying the Rasch model. Fundamental measurement
in the human sciences. 20072nd ed. New York Routledge
10. Wright B, Stone M Best test design. Rasch measurement
. 1979 Chicago Mesa Press
11. Tennant A, Conaghan P. The Rasch measurement
model in rheumatology: what is it and why use it? When should it be applied and what should one look for in a Rasch paper? Arthritis Rheum. 2007;57:1358–1362
12. Merbitz C, Morris J, Grip JC. Ordinal scale and foundations of misinference. Arch Phys Med Rehabil. 1989;70:308–312
13. Smith A, Wright P, Selby P, Velikova G. A Rasch and factor analysis of the Functional Assessement of Cancer Therapy-General (FACT-G). Health Qual Life Outcomes. 2007;5:19
user manual, version 19. 2010 New York IBM Corporation
15. Streiner D, Norman G Health measurement
scales: a practical guide to their development and use. 19952nd ed. London Oxford University Press
16. Cronbach L. Coefficient alpha and the internal structure of tests. Psychometrika. 1951;16:297–334
17. Linacre J. Foreword. Applying the Rasch model. Fundamental measurement
in the human sciences, by Bond T and Fox C. 20072nd ed. New York Routledge
18. Pallant J, Tennant A. An introduction to the Rasch measurement
model: an example using the Hospital Anxiety and Depression Scale (HADS). Br J Clin Psychol. 2007;46:1–18
19. Bland J, Altman D. Multiple significance tests: the Bonferroni method. Br Med J. 1995;310:170
20. Smith E. Detecting and evaluating the impact of multi-dimensionality using item fit statistics and principal components analysis of residuals. J Appl Meas. 2002;3:205–231
21. Cox EP. The optimal number of response alternatives for a scale: a review. J Market Res. 1980;17:407–422
22. Johnson C, Aaronson N, Blazeby J, Bottomley A, Fayers P, Koller M, et al. Guidelines for developing questionnaire modules.. 20114th ed. EORTC Brussels EORTC Quality of Life
23. Khadka J, Gothwal V, McAlinden C, Lamoureux E, Pesudovs K. The importance of rating scales in measuring patient-reported outcomes. Health Qual Life Outcomes. 2012;10:80
24. Winstanley J, White E. The MCSS-26©: revision of the Manchester Clinical Supervision Scale using the Rasch Measurement
Model. J Nurs Meas. 2011;19:60–176
25. Nilsson A, Tennant A. Past and present issues in Rasch analysis
: the functional independence measure (FIMTM
) revisited. J Rehabil Med. 2011;43:884–891
26. Osteras N, Gulbrandsen P, Garratt A, Benth J, Dahl F, Natvig B, Brage S. A randomised comparison of a four and a five point scale version of the Norwegian Function Assessment Scale. Health Qual Life Outcomes. 2008;6 [Epub ahead of print]
Keywords:© 2013 Lippincott Williams & Wilkins, Inc.
clinical trials; FACT-Melanoma; measurement; melanoma; quality of life; Rasch analysis; response format