Skip Navigation LinksHome > February 2014 - Volume 46 - Issue 1 > Measurement Properties of the 12-Item Short-Form Health Surv...
Journal of Neuroscience Nursing:
doi: 10.1097/JNN.0000000000000027

Measurement Properties of the 12-Item Short-Form Health Survey in Stroke

Westergren, Albert; Hagell, Peter

Free Access
Article Outline
Collapse Box

Author Information

Questions or comments about this article may be directed to Albert Westergren, RN PhD, at He is a Professor at the PRO-CARE Group, School of Health and Society, Kristianstad University, Kristianstad, Sweden.

Peter Hagell, RN PhD, is a Professor at the PRO-CARE Group, School of Health and Society, Kristianstad University, Kristianstad, Sweden.

Albert Westergren conceived and designed the study, carried out the analyses, and drafted the manuscript. Peter Hagell conceived and designed the study, revised the manuscript, and interpreted the data. Both authors read and approved the final manuscript.

This study was supported by the Swedish Research Council and the Skane County Council Research and Development Foundation.

The authors declare no conflicts of interest.

Collapse Box


ABSTRACT: Background: The 12-item Short-Form Health Survey (SF-12) was developed to measure perceived physical and mental health. Some studies of the psychometric properties, using classical test theory, of the SF-12 provide support for its use in patients with stroke, but it has not been scrutinized using recommended modern test theory approaches such as the Rasch measurement model among stroke survivors. Objectives: This study sought to explore the measurement properties of the SF-12 physical and mental health scales among people with stroke using the Rasch measurement model. Design: A cross-sectional design was used in this study. Methods: All patients discharged from a dedicated stroke unit in southern Sweden during 6 months were asked to participate 6 months later. Of 120 stroke survivors, 89 (74%) agreed to participate. Rasch analysis was used to assess the measurement properties of the SF-12 physical and mental component summary scores (PCS-12 and MCS-12, respectively). Results: For the PCS-12, we identified problems with targeting, overall and item-level fit, representing local response dependency, and multidimensionality. For the MCS-12, there were problems related to targeting (the persons felt better than the scale could conceptualize) and response categories that did not function as expected. However, MCS-12 items displayed reasonable model fit without indications of multidimensionality but with signs of local response dependency. Conclusion: The measurement properties of the MCS-12 in stroke appear reasonable unless milder mental health problems are of interest, whereas those of the PCS-12 are less acceptable. Given the interdependence between MCS-12 and PCS-12 that is inherent with the standard SF-12 scoring algorithm, such data should be interpreted with caution.

Stroke is one of the most common and disabling chronic conditions in Western societies (Murray & Lopez, 1997). Promoting health status among persons with stroke is an issue not only in hospitals but also for community-based services and primary healthcare (Mayo, Wood-Dauphinee, Cote, Durcan, & Carlton, 2002). Systematic tracking of health outcomes after discharge from the hospital is therefore desirable to gain insights into long-term outcomes among stroke survivors. To be useful in determining the effects of health-promoting actions after stroke, such measures need to demonstrate good psychometric properties in terms of reliability and validity (Cano & Hobart, 2011).

Physical and mental health is a concern for nurses, and the concept of “perceived health” is frequently used in nursing research (Moons, Budts, & De Geest, 2006). The 12-item Short-Form Health Survey (SF-12) is a generic (non-disease-specific) rating scale used to measure perceived physical and mental health. The SF-12 is a short version of the SF-36 (Ware & Sherbourne, 1992) and can be used as a self-report questionnaire as well as in interviews (Ware, Kosinski, & Keller, 1995, 1996). The SF-12 was developed to reproduce the physical and mental summary scores of the SF-36 (Ware et al., 1996), and strong agreement between SF-12 and SF-36 generated summary scores among, for example, people with stroke has been found (Pickard, Johnson, Penn, Lau, & Noseworthy, 1999). The SF-12 has also been found useful among independently living older people (Resnick & Nahm, 2001) and more suitable for older people than the SF-36 because it has fewer items (Hayes, Morris, Wolfe, & Morgan, 1995).

There has been general support for the psychometric properties of the SF-12 according to classical test theory (CTT; Bowling, 1997; Gandek et al., 1998; Ware et al., 1996) also among people with stroke (Bohannon, Maljanian, Lee, & Ahlquist, 2004; Lim & Fisher, 1999; Pickard et al., 1999), although some concerns have been raised regarding its items-to-scale structure (Jakobsson, Westergren, Lindskov, & Hagell, 2012). However, CTT has a number of well-recognized limitations that can be overcome by modern psychometric methods such as the Rasch measurement model (Hagquist, Bruce, & Gustavsson, 2009; Hobart & Cano, 2009; Rasch, 1960; Wright & Masters, 1982). The Rasch model allows for rigorous testing of measurement instruments and makes it possible to examine aspects of measurement beyond the limitations of CTT (Cano & Hobart, 2011). It is therefore considered superior to CTT in determining rating scale measurement properties (Hobart & Cano, 2009). However, the Rasch model has hitherto rarely been used in nursing research (Hagquist et al., 2009), and the SF-12 has, to the best of our knowledge, not been Rasch analyzed among stroke survivors.

Back to Top | Article Outline


The aim of this study was to explore the measurement properties of the SF-12 physical and mental health scales among people with stroke using the Rasch measurement model.

Back to Top | Article Outline



All patients discharged from a dedicated stroke unit in southern Sweden during 6 months were asked to participate 6 months later (Jakobsson et al., 2012; Pajalic, Karlsson, & Westergren, 2006; Westergren, 2008; Westergren & Hagell, 2006). No predefined exclusion criteria were used. In total, 120 persons were eligible for inclusion, of whom 89 (74%) agreed to participate (age, M = 77.2 years, SD = 6.6 years; 49.4% women). Of these, 72 had been discharged to their own homes, and 17, to special accommodations. The vast majority (n = 82) had no clinically significant cognitive impairments, whereas seven exhibited mild memory problems, according to overall clinical assessments (Berger, 1980). There were no differences in age, gender, or perceived general health between those who did (n = 89) and those who did not (n = 31) agree to participate (Table 1). The main reasons for dropout were lack of energy (n = 11) and cognitive or communication difficulties (n = 4). The study was approved by the local Research Ethics Committee (LU 827-02).

Table 1
Table 1
Image Tools
Back to Top | Article Outline
Procedures and Data Collection

Data collection was conducted 6 months after discharge as structured face-to-face interviews by a specialized stroke research nurse (trained and experienced with this type of data collection). Interviews were conducted by strict adherence to a structured protocol including the SF-12, which was interviewer administered according to recommendations in the SF-12 manual (Ware et al., 1995), in the patients’ own homes (n = 71), special accommodations (n = 15), and during clinic follow-ups (n = 3). All responses were provided by the patients themselves. Nonparticipating patients were asked to complete three questions (age, gender, and general health).

Back to Top | Article Outline
12-Item Short-Form Health Survey

The SF-12 consists of 12 items representing the eight health domains of the SF-36 (one or two items from each SF-36 domain) intended to reproduce the two physical and mental component summary scores (PCS-12 and MCS-12, respectively; Ware et al., 1995, 1996; Ware & Sherbourne, 1992). The PCS-12 and MCS-12 are assumed to be represented by six items each (Table 2).

Table 2
Table 2
Image Tools
Back to Top | Article Outline
The Rasch Model

The Rasch model is used in psychometrics, a field of study concerned with the validation and usefulness of rating scales that are commonly used in, for example, the clinical neurosciences. Based on fundamental measurement principles from the physical sciences, the Rasch model mathematically defines what is required from item responses to yield linear measures (Rasch, 1960). According to the Rasch model, the probability of a certain item response is a function of the difference between the level of the measured construct (e.g., health) represented by the item and that possessed by the person. The model separately locates persons and items on a common interval level logit (log-odd units) metric, which ranges from minus infinity to plus infinity (with mean item location set at zero). The extent to which successful measurement has been achieved is determined by examining the fit between observed data and model expectations. If data are in accordance with the model, the scale can be used for invariant comparisons (Ewing, Salzberger, & Sinkovics, 2005; Hagquist et al., 2009).

Rasch analysis addresses a range of fundamental requirements for successful rating-scale-based measurement. First, a good scale should be unidimensional; that is, the items in the scale should represent the same construct, with items otherwise being independent of each other (referred to as local response independence). Second, item responses should map out and cover a quantitative continuum that is well targeted to the range of levels represented by the persons who are measured. Third, and related to the previous point, the instrument should be able to separate groups of people (strata), for instance, according to differences in perceived health. Fourth, the instrument should work in the same way regardless of other person factors such as gender and age. That is, person locations on the quantitative continuum should not be dependent on whether the respondent is male or female, old or young, and so on (Fleishman & Lawrence, 2003; Hagquist et al., 2009).

Back to Top | Article Outline
SF-12 Analyses

The PCS-12 and MCS-12 were analyzed individually as two separate six-item physical and mental health scales, respectively, according to the unrestricted (partial credit) Rasch model, which does not assume equal response categories across items (Hagquist et al., 2009; Wright & Masters, 1982). Rasch analysis was performed using RUMM 2020 (Rumm Laboratory Pty. Ltd., Perth, WA); all other analyses were conducted using SPSS 14 (SPSS Inc., Chicago, Illinois). P values are two tailed and considered significant when <.05 following Bonferroni adjustment.

On the basis of the structure proposed by Wright and Masters (1982) and Hobart and Cano (2009), we used Rasch analysis in attempting to answer the following main questions regarding the PCS-12 and MCS-12 among stroke survivors:

1. Is the scale-to-sample targeting adequate for making judgments about the performance of the PCS-12 and MCS-12 scales and the health measurement of people with stroke?

2. Have PCS-12 and MCS-12 measurement rulers been constructed successfully among people with stroke?

3. Have the people with stroke been measured successfully by the PCS-12 and MCS-12?

Details of the analyses are presented in combination with the presentation of results.

Back to Top | Article Outline

Integrated Description of Analysis and Results

Is the Scale-to-Sample Targeting Adequate for Making Judgments About the Performance of the PCS-12 and MCS-12 Scales and the Health Measurement of People With Stroke?

Good targeting means that the scale represents the levels of health reported by the sample; poor targeting undermines good measurement of people reporting health states that are not covered by the scales’ range of measurement. Similarly, to allow for reasonable assessments of the performance of the scale, there need to be observations in the sample that represent the full scale range. Good targeting is therefore essential for good measurement, whereas mistargeting results in lower precision and problems with differentiating between people along the latent scale (Hagquist et al., 2009). One indicator of the targeting of a scale is the mean person location, which expresses the average magnitude and direction by which the person locations differ from the item locations (which are set at a mean of 0 logits). In terms of the SF-12, positive person logit locations indicate that the sample experiences better health than that represented by the scale, and vice versa for negative person logit locations (Hagquist et al., 2009). In general, mean person locations up to ±0.5 logits indicate good targeting (Hudgens, Dineen, Webster, Lai, & Cella, 2004; Lai & Eton, 2002).

Results: For both the MCS-12 and PCS-12, item threshold locations were fairly well covered by the persons (Figure 1). This suggests that targeting was adequate for making judgments about the performance of the PCS-12 and MCS-12 scales. For the PCS-12, the mean person location was 0.223 (SD = 1.715) logits (Figure 1A), suggesting that targeting was acceptable for measurement of the people. However, for the MCS-12, the mean person location was 1.131 (SD = 1.385) logits (Figure 1B). That is, the MCS-12 tends to represent more severe mental health problems than that experienced by the sample, as a substantial proportion of people are located outside (above) its measurement range (Figure 1B). Thus, the scale provides limited information about people with better mental health. Clinically, it is primarily the ability to make adequate measurement of people that is of interest. From this perspective, our observations suggest that the PCS-12 would be able to meet this need among people with stroke, whereas this is not the case for the MCS-12 among those with less severe mental health problems.

Figure 1
Figure 1
Image Tools

Note. Thresholds are the locations where there is a 50/50 probability of responding in either of two adjacent item response categories. Item locations are represented by the mean response category threshold locations for each item. (a) Physical component summary score. (b) Mental component summary score.

Back to Top | Article Outline
Have PCS-12 and MCS-12 Measurement Rulers Been Constructed Successfully Among People With Stroke?
Do the Item Response Categories Work as Intended?

Ordered response category thresholds appear when the categories function as expected in a sample, that is, when they reflect an increasing amount of the measured variable. Disordered thresholds mean that the locations where there is a 50/50 probability of responding in either of two adjacent categories do not appear in an expected sequential order from less to more. There can be many reasons for disordered thresholds. For example, response category labels may be ambiguously worded or too many for respondents to make reliable distinctions between them. Other reasons include multidimensionality and poor targeting (Hagquist et al., 2009).

Results: There were no disordered thresholds in any of the items in the PCS-12. However, the MCS-12 exhibited disordered thresholds in items 9, 11, and 12 (Figure 2A–C). As a consequence, response categories 2 and 3 (item 9), 1 and 2 (item 11), and 1 (item 12) never appear as the most probable responses (Figure 2). This is problematic because it challenges both the user friendliness of the scale and the interpretability of responses. Although the exact cause for the observed threshold disordering cannot be determined from the data, it suggests difficulties among respondents in using and distinguishing between these response categories with an associated increased respondent burden; simultaneously, interpretability of responses is compromised.

Figure 2
Figure 2
Image Tools

Note. Disordered thresholds were found for (a) item 9 (“feeling calm and peaceful”) involving response categories 2 and 3, (b) item 11 (“feeling downhearted and blue”) involving response categories 1 and 2, and (c) item 12 (“health interference with social activities”) involving response category 1.

Back to Top | Article Outline
Do the Items Map Out Discernible Lines of Increasing Intensity?

To successfully quantify something, the measuring instrument should have indicators representing various levels on the quantitative continuum from less to more. That is, there need to be “notches” on the ruler, and these should preferably be located at regular intervals along the continuum without excessive clustering or gaps between them. In rating scales, the notches are the item response category thresholds, and these can be examined to see if items successfully map out a discernible line of inquiry (Hobart & Cano, 2009).

Results: It can be seen from Figure 1 (bottom histograms of panels A and B) that both PCS-12 and MCS-12 items represent a quantitative continuum from less to more, without excess clustering of thresholds at the same locations. However, both scales, particularly the PCS-12 (Figure 1A), demonstrate some gaps between thresholds. This means that measurement precision is compromised along these areas. The significance of this depends partly on the purpose of using the scale and the nature of the target sample. That is, if only a crude assessment of health levels are desired, it may be of minor importance, whereas it can have detrimental consequences if small but clinically important differences or changes need to be detected within these levels of perceived health.

Back to Top | Article Outline
Are the Locations of Items Along This Line Reasonable?

The hierarchical ordering of items along the quantitative continuum from less to more provides a means of assessing the internal validity of items in a scale. This refers to whether the relative locations of items are clinically reasonable given the nature of the sample and the variable (Hobart & Cano, 2009). Because stroke is associated with acute onset and progressive improvements in physical and mental health are expected thereafter, it is reasonable to consider PCS-12 and MCS-12 item locations as health hierarchies from negative (poorer health) to positive (better health) locations.

Results: Table 3 lists PCS-12 and MCS-12 items according to their average locations (each item location is the mean of its response category threshold locations), where negative logit locations represent poorer health and positive locations represent better health. Observed PCS-12 item ordering shows that the extremes in the scale are constituted by items representing pain interference (item 8; −2.039 logits) and limited in kind of work (item 5; 0.771 logits), respectively (Table 3). This, as well as the overall item hierarchy, appears generally reasonable from a clinical perspective because the ordering suggests that perceived physical health improvement after stroke progresses from less interfering pain, improved general health, improved ability to climb stairs, less interference with moderate activities, being able to accomplish more, and finally, becoming less limited in work/activities.

Table 3
Table 3
Image Tools

As for the MCS-12, the extremes are constituted by items representing feeling downhearted and blue (item 11; −0.929 logits) and energy (item 10; 1.363 logits), respectively (Table 3). The implied pattern of mental health recovery would thus go from feeling less downhearted and blue, feeling more calm and peaceful, experiencing less mental health interference in social activities, less emotionally based interference in work/activities, accomplishing more, and finally, regained energy. Similarly to the PCS-12, these findings appear generally reasonable from a clinical perspective and support the internal validity of the scale.

Back to Top | Article Outline
Do Items Work Together to Operationalize Single Variables?

A basic assumption when summing rating scale items into a total score is that the items represent a common underlying variable, that is, that they are unidimensional (Nunally & Bernstein, 1994). The extent to which items in a scale work together as representatives of a single variable is reflected through analyses of fit between item response data and Rasch model requirements. Fit refers to the extent to which observed item responses are predicted by, or recovered from, the Rasch model (Hobart & Cano, 2009). Rasch analysis provides overall as well as individual-item-level fit statistics. However, no individual aspect of fit is either sufficient or necessary on its own, but they need to be interpreted interactively and relative to the variable, sample, and purpose of scale use (Andrich, Luo, & Sheridan, 2004; Hagquist et al., 2009; Hobart & Cano, 2009). In a scale with good overall model fit, the mean standardized item fit residuals (the discrepancies between observed and expected responses summarized over all items and respondents) should be close to zero with a standard deviation close to 1 and a nonsignificant total item–trait interaction chi-square statistic. If these criteria are not fulfilled, it indicates that there might be items in the scale that do not fit the model at all levels of ability and over all items (Hagquist et al., 2009).

Individual item fit can also be assessed graphically by item characteristic curves (ICC), where observed item responses are compared with model expectations. Quantitatively, standardized item fit residuals represent the interaction between the specific item and the people who responded to it (Hobart & Cano, 2009). In general, individual item fit residuals should range between −2.5 and 2.5, with the ideal being 0 (Andrich et al., 2004). Negative values suggest overdiscriminating items, whereas positive values suggest underdiscriminating items (Hobart & Cano, 2009). Overdiscriminating items contribute greatly with information but over a narrow range and suggest local response dependence (item redundancy). Local response dependence is also manifested by correlated item residuals. Underdiscriminating items provide less information and is typically due to multidimensionality (i.e., the item represents a different variable than the scale as a whole). However, the nature of observed misfit (local response dependence or multidimensionality) often needs to be determined based on conceptual reasoning rather than statistics (Hagquist et al., 2009). One statistical approach is to combine items into higher order items (so called subtests, or testlets) and compare the resulting reliability with that from the original single items; decreased reliability after the combination of items suggests local response dependence (Hagquist et al., 2009; Marais & Andrich, 2008b). Alternatively, one of two locally dependent items may be omitted.

The chi-square value is another indicator of the interaction between the item and the measured variable (here physical and mental health). The associated p value is the probability that the discrepancy between the observed and the expected values is a chance finding, and if significant, it suggests misfit (Hobart & Cano, 2009). Chi-square values can also be interpreted as an order statistic, where a sudden leap (increases) is suggestive of misfit (Andrich et al., 2004).

Results: Overall, MCS-12 items were found to fit the model better, with a mean item fit residual closer to zero (M = −0.204, SD = 0.433) than was found for the PCS-12 (M = −0.813, SD = 1.716). Accordingly, the total item–trait interaction was significant for PCS-12 but not for MCS-12 (Table 4). There were no overt signs of item-level misfit according to the fit residuals, which were within the range of −2.5 and 2.5 for both the PCS-12 (range, −2.159 to 2.107) and the MCS-12 (range, −0.767 to 0.250; Table 3).

Table 4
Table 4
Image Tools

Chi-square values ranged from 3.232 (climb several flights of stairs) to 13.806 (general health) in the PCS-12 and from 0.564 (downhearted and blue) to 4.476 (accomplish less emotional) in the MCS-12 (Table 3). Inspection of how these values changed sequentially across items (Table 3) revealed that for the PCS-12, there is a major leap to item 1, associated with a significant p value (following Bonferroni correction). These observations suggest that item 1 may be introducing multidimensionality to the PCS-12. For the MCS-12, there is a more gradual increase in the chi-square values, with no significant p values (Table 3).

Inspection of the ICCs for each item indicated problems with PCS-12 items 4 and 5 as well as with MCS-12 items 6 and 7, suggesting overdiscrimination and possible local response dependence (Figure 3). Accordingly, we examined the residual correlations, which were strong between PCS-12 items 4 and 5 (r = .920) and MCS-12 items 6 and 7 (r = .953). We have previously observed similar problems with these SF-12 items among people with Parkinson disease (Hagell & Westergren, 2011). In that study, explorative deletion of items 5 (PCS-12) and 6 (MCS-12) improved the respective scales. Similarly to that study, combining items 4 + 5 and 6 + 7 into subtests resulted in decreased reliabilities (Table 4), supporting the presence of local response dependency. Deletion of items 5 and 6 resulted in slight improvements of both the PCS-12, item mean (SD) fit residual = −0.469 (1.491), χ2(10) = 32.014, p = .0004, and MCS-12, item mean (SD) fit residual = 0.053 (0.609), χ2(10) = 7.758, p = .652.

Figure 3
Figure 3
Image Tools

Collectively, these observations regarding model fit hint at several directions. First, the observed multidimensionality in the PCS-12 (related to item 1) means that interpretation of these scores is challenged because it is uncertain what they represent. Second, both the PCS-12 and MCS-12 suffered from local dependency due to item redundancy, which thus increases respondent burden without any gains in terms of measurement.

Note. Black dots represent the observed responses in the sample divided into three class intervals according to their locations on the measured construct, indicated by the marks on the x axis. (a) Item 5 (“limited in kind of work”) of the physical component of the 12-item Short-Form Health Survey. (b) Item 6 (“accomplished less due to emotional health”) of the mental component of the 12-item Short-Form Health Survey. Both panels represent overdiscrimination (empirical observations are steeper than expected).

Back to Top | Article Outline
Do Items Function the Same Way Across Subgroups of People?

Analysis of differential item functioning (DIF) allows for testing of whether items have the same meaning for different subgroups of respondents, for instance, men and women (Hagquist et al., 2009). When items are invariant, there is no DIF. There are two major types of DIF. Uniform DIF means that there is a systematic difference between subgroups in how people respond to an item, independent of their locations on the measured variable. When there is nonuniform DIF, there is an interaction effect between subgroup affiliation and location on the measured variable. That is, different levels on the measured variable (class intervals) and group affiliation (e.g., gender) interact. Both types are detrimental for invariant and valid measurement (Hagquist et al., 2009).

Results: There were no DIFs for age or gender in either the PCS-12 or the MCS-12. This suggests that items work the same way among both genders and age groups, and that responses are not biased by these factors. Therefore, scores may be compared between these groups in a valid manner.

Back to Top | Article Outline
Have the People With Stroke Been Measured Successfully by the PCS-12 and MCS-12?
Are the Persons Separated Along the Lines of Inquiry Represented by the Items?

An important aim with measurement is to detect differences between people. In Rasch analysis, the separation of people is quantified in terms of the Person Separation Index (PSI). The PSI is conceptually analogous to Cronbach’s alpha and an estimate of reliability (Andrich, 1982). A PSI value above 0.7 is generally considered acceptable and suggests that at least two statistically distinct strata of people can be identified (Andrich, 1982; Hagquist et al., 2009; Hobart & Cano, 2009; Wright & Masters, 1982).

Results: The PSI was above 0.7 for both the PCS-12 (0.79) and the MCS-12 (0.84). This was true also when treating items 4 + 5 of PCS-12 (PSI = 0.73) and items 6 + 7 of MCS-12 (PSI = 0.80) as one subtest item each (see above). These coefficients refer to the amount of estimated measurement error and indicate that both scales met commonly accepted standards. Both the PCS-12 and MCS-12 therefore appear reasonably able to differentiate groups of people; reliabilities of 0.7 and 0.8 imply that about two and three distinct levels of health can be identified, respectively (Wright & Masters, 1982).

Back to Top | Article Outline
How Valid Are the Person Measurements?

In Rasch analysis, it is possible to verify whether the person has responded to items in an expected way, that is, consistent with the idea that items map out a variable along which they have a unique order (Hobart & Cano, 2009). If an individual’s responses are not in general agreement with the ordering of items, the validity of that person’s measure can be questioned (Hobart & Cano, 2009). Similar to the fit of items, this can be examined by individual person fit statistics, where standardized residuals should average zero for the sample and range between −2.5 and 2.5 for each individual.

Results: Overall person fit residuals were close to the expected value of zero both for MCS-12 and PCS-12 (Table 4). Individual person fit residuals were also acceptable (range MCS-12, −1.645 to 1.357; range PCS-12, −1.723 to 2.559), with only one individual exceeding ±2.5. This suggests that the responses provided by the people in the current sample are valid and therefore can be interpreted with confidence.

Back to Top | Article Outline


This is, as far as we know, the first study assessing the measurement properties of the SF-12 in stroke by using the Rasch measurement model. For the PCS-12, we identified problems with targeting as well as with overall and item level fit, representing local response dependency and multidimensionality. For the MCS-12, there were problems related to targeting and disordered thresholds, but items displayed reasonable model fit, except for local response dependency. This is in accordance with earlier factor analytic observations that failed to obtain support for the proposed item-to-scale structure of the SF-12 in stroke (Jakobsson et al., 2012).

The PCS-12 did not appear to represent a unidimensional construct. Unidimensionality is a basic requirement for the use of total scores, and violation thereof challenges the meaning, interpretability, and validity of scores (Hagell & Nilsson, 2009; Nunally & Bernstein, 1994). The possible multidimensionality found with PCS-12 item 1 (“general health”) is in accordance with previous observations in stroke regarding SF-36-derived PCS scores (Hobart, Williams, Moran, & Thompson, 2002). Conceptually, multidimensionality means that item responses are not governed by variations in a common variable. Indeed, it may be questioned whether people’s perceived general health is congruent with a “physical health” variable.

We also observed local response dependencies for PCS-12 items 4 and 5 and MCS-12 items 6 and 7. This is not surprising when considering the respective item contents (Table 2). Indeed, inspection of the response patterns to these item pairs revealed that, with just one exception each, every person in our sample responded identically to items 4 and 5 and to items 6 and 7, respectively. This clearly suggests item redundancy. Although local response dependency may be thought of as less problematic than multidimensionality, it can influence person measures and mask differences when measuring change (Marais, 2009; Marais & Andrich, 2008a, 2008b). Interestingly, the pattern observed here among people with stroke is very similar to that previously reported among people with Parkinson disease (Hagell & Westergren, 2011). This replicated finding in an independent group of people with another neurological disorder strengthens the significance of this result and suggests that the SF-12 might benefit from removing items 5 and 6, at least when used among people with neurological disorders.

We found evidence of problems with targeting, particularly for the MCS-12, which appears to represent poorer mental health than that experienced by the respondents. This implies limited possibilities to detect changes (improvements) over time or after interventions. This may explain previous findings by Bohannon et al. (2004), who reported poor responsiveness to change up to 12 months after stroke for the MCS-12 relative to the PCS-12. In addition, Muller-Nordhorn et al. (2005) found better responsiveness of the MCS-12 after stroke among people with more severe symptoms compared with those with milder symptoms. Improving the measurement of mental health outcomes after stroke would thus require either the use of another scale with better targeting than the MCS-12 or further development of the MCS-12 by adding item(s) that represent less severe mental health impact.

According to the standard SF-12 scoring algorithm (Ware et al., 1995, 1996), PCS-12 and MSC-12 scores are uncorrelated and individual item scores contribute differently to both scales. That is, scores on one of the scales influence scores on the other scale, so that poor PCS-12 scores have a positive influence on the same person’s MCS-12 scores (and vice versa; Farivar, Cunningham, & Hays, 2007). This could, at least in part, explain the counterintuitive finding that MCS-12 scores among people with stroke do not differ compared with general population values (Bohannon et al., 2004). In addition, the assumed interrelationship between MCS-12 and PCS-12 scores when using the standard SF-12 scoring algorithm implies that the scales cannot be used separately. It also suggests that compromised measurement properties in one scale may spill over and influence the quality of scores from the other scale when using the standard scoring algorithm. In this study, however, we analyzed the two scales separately from each other, and because Rasch analysis is directly related to the raw total scores (Andrich et al., 2004; Cano & Hobart, 2011; Hobart & Cano, 2009; Rasch, 1960), our observations imply an alternative, simpler, and more flexible means of using and scoring the SF-12.

The main limitation of this study is its sample size. However, sample sizes as small as n = 30 are able to provide useful and representative results, and sample size requirements in Rasch analysis relate to targeting and increase with poorer targeting (Linacre, 1994). In this study, items were generally well covered by the sample, which increases the confidence in results (Hobart & Cano, 2009). In view of this, our results probably represent a fairly liberal view of the SF-12 in stroke because the limited sample size primarily appears to restrict the ability of analyses to detect model violations. Particularly, results from DIF analyses should be viewed with caution. While our observations provide valuable information regarding the measurement properties of the SF-12 in stroke, additional studies in larger samples are warranted for firmer conclusions.

Back to Top | Article Outline

This study illustrates how the Rasch model can be used for rigorous examination of measurement scales in nursing research and shows one way to present the findings. In doing so, we found that the measurement properties of the MCS-12 appear acceptable in stroke but limited by targeting problems and item redundancy, which has negative implications for its use as an outcome measure. However, it may be useful as a survey tool to assess more severe levels of mental health impact after stroke. The measurement properties of the PCS-12 are associated with more severe problems related to, for example, the interpretability of scores. The interdependence between MCS-12 and PCS-12 scores in the standard SF-12 scoring algorithm is likely to distort measurement. Separately derived raw sum scores or, preferably, Rasch-derived measures are recommended instead. Stroke clinicians and researchers should use and interpret the SF-12 with caution.

Back to Top | Article Outline

The authors thank all participating patients for their cooperation and Siv Karlsson for assistance with data collection. The study was conducted within the Patient-Reported Outcomes–Clinical Assessment Research and Education (PRO-CARE) Group, Kristianstad University, Sweden. The authors especially thank the Swedish house in Kavalla, Greece.

Back to Top | Article Outline


Andrich D. (1982). An Index of Person Separation in Latent Trait Theory, the Traditional KR-20 Index, and the Guttman Scale Response Pattern. Educational and Psychological Research, 9 (1), 95–104.

Andrich D., Luo G., Sheridan B. (2004). Interpreting RUMM2020. Perth, WA: RUMM Laboratory.

Berger E. Y. (1980). A system for rating the severity of senility. Journal of the American Geriatric Society, 28 (5), 234–236.

Bohannon R. W., Maljanian R., Lee N., Ahlquist M. (2004). Measurement properties of the Short Form (SF)-12 applied to patients with stroke. International Journal of Rehabilitation Research, 27 (2), 151–154.

Bowling A. (1997). Measuring health: A review of quality of life measurement scales (second edition). Buckingham, UK: Open University Press.

Cano S. J., Hobart J. C. (2011). The problem with health measurement. Patient Preference and Adherence, 5, 279–290. doi:10.2147/PPA.S14399

Ewing M., Salzberger T., Sinkovics R. (2005). An alternative approach to assessing cross-cultural measurement equivalence in advertising research. Journal of Advertising, 34 (1), 17–36.

Farivar S. S., Cunningham W. E., Hays R. D. (2007). Correlated physical and mental health summary scores for the SF-36 and SF-12 Health Survey, V.I. Health and Quality of Life Outcomes, 5, 54. doi:10.1186/1477-7525-5-54

Fleishman J. A., Lawrence W. F. (2003). Demographic variation in SF-12 scores: True differences or differential item functioning?. Medical Care, 41 (7 Suppl.), III75–III86. doi:10.1097/01.MLR.0000076052.42628.CF

Gandek B., Ware J. E., Aaronson N. K., Apolone G., Bjorner J. B., Brazier J. E., Sullivan M. (1998). Cross-validation of item selection and scoring for the SF-12 Health Survey in nine countries: Results from the IQOLA Project. International Quality of Life Assessment. Journal of Clinical Epidemiology, 51 (11), 1171–1178.

Hagell P., Nilsson M. H. (2009). The 39-item Parkinson’s Disease Questionnaire (PDQ-39): Is it a unidimensional construct? Therapeutic Advances in Neurological Disorders, 2 (4), 205–214. doi:10.1177/1756285609103726

Hagell P., Westergren A. (2011). Measurement properties of the SF-12 Health Survey in Parkinson’s disease. Journal of Parkinson’s Disease, 2 (1), 185–196. doi:10.3233/JPD-2011-11026

Hagquist C., Bruce M., Gustavsson J. P. (2009). Using the Rasch model in nursing research: An introduction and illustrative example. International Journal of Nursing Studies, 46 (3), 380–393. doi:10.1016/j.ijnurstu.2008.10.007

Hayes V., Morris J., Wolfe C., Morgan M. (1995). The SF-36 Health Survey Questionnaire: Is it suitable for use with older adults?. Age and Ageing, 24 (2), 120–125.

Hobart J., Cano S. (2009). Improving the evaluation of therapeutic interventions in multiple sclerosis: The role of new psychometric methods. Health Technology Assessment, 13 (12), iii, ix–x, 1–177. doi:10.3310/hta13120

Hobart J., Williams L., Moran K., Thompson A. (2002). Quality of life measurement after stroke: Uses and abuses of the SF-36. Stroke, 33 (5), 1348–1356.

Hudgens S., Dineen K., Webster K., Lai J., Cella D. (2004). Assessing statistically and clinically meaningful construct deficiency/saturation: Recommended criteria for content coverage and item writing. Rasch Measurement Transactions, 17, 954–955.

Jakobsson U., Westergren A., Lindskov S., Hagell P. (2012). Construct validity of the SF-12 in three different samples. Journal of Evaluation in Clinical Practice, 18 (3), 560–566. doi:10.1111/j.1365-2753.2010.01623.x

Lai J., Eton D. (2002). Clinically meaningful gaps. Rasch Measurement Transaction, 15, 850.

Lim L. L., Fisher J. D. (1999). Use of the 12-item Short-Form (SF-12) Health Survey in an Australian heart and stroke population. Quality of Life Research, 8 (1–2), 1–8.

Linacre J. (1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7 (4), 328.

Marais I. (2009). Response dependence and the measurement of change. Journal of Applied Measurement, 10 (1), 17–29.

Marais I., Andrich D. (2008a). Effects of varying magnitude and patterns of response dependence in the unidimensional Rasch model. Journal of Applied Measurement, 9 (2), 105–124.

Marais I., Andrich D. (2008b). Formalizing dimension and response violations of local independence in the unidimensional Rasch model. Journal of Applied Measurement, 9 (3), 200–215.

Mayo N. E., Wood-Dauphinee S., Cote R., Durcan L., Carlton J. (2002). Activity, participation, and quality of life 6 months poststroke. Archives of Physical Medicine and Rehabilitation, 83 (8), 1035–1042.

Moons P., Budts W., De Geest S. (2006). Critique on the conceptualisation of quality of life: A review and evaluation of different conceptual approaches. International Journal of Nursing Studies, 43 (7), 891–901. doi:10.1016/j.ijnurstu.2006.03.015

Muller-Nordhorn J., Nolte C. H., Rossnagel K., Jungehulsing G. J., Reich A., Roll S., Willich S. N. (2005). The use of the 12-item Short-Form Health Status Instrument in a longitudinal study of patients with stroke and transient ischaemic attack. Neuroepidemiology, 24 (4), 196–202. doi:10.1159/000084712

Murray C. J., Lopez A. D. (1997). Alternative projections of mortality and disability by cause 1990–2020: Global Burden of Disease Study. Lancet, 349 (9064), 1498–1504. doi:10.1016/S0140-6736(96)07492-2

Nunally J., Bernstein I. (1994). Psychometric theory (3rd ed.). New York, NY: McGraw-Hill.

Pajalic Z., Karlsson S., Westergren A. (2006). Functioning and subjective health among stroke survivors after discharge from hospital. Journal of Advanced Nursing, 54 (4), 457–466. doi:10.1111/j.1365-2648.2006.03844.x

Pickard A. S., Johnson J. A., Penn A., Lau F., Noseworthy T. (1999). Replicability of SF-36 summary scores by the SF-12 in stroke patients. Stroke, 30 (6), 1213–1217.

Rasch G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Danish Institute for Educational Research.

Resnick B., Nahm E. S. (2001). Reliability and validity testing of the revised 12-item Short-Form Health Survey in older adults. Journal of Nursing Measurement, 9 (2), 151–161.

Ware J., Kosinski M., Keller S. (1995). SF-12: How to score the SF-12 physical and mental health summary scales (2nd ed.). Boston, MA: The Health Institute, New England Medical Center.

Ware J., Kosinski M., Keller S. (1996). A 12-Item Short-Form Health Survey: Construction of scales and preliminary tests of reliability and validity. Medical Care, 34 (3), 220–233.

Ware J., Sherbourne C. (1992). The MOS 36-item Short-Form Health Survey (SF-36). I. Conceptual framework and item selection. Medical Care, 30 (6), 473–483.

Westergren A. (2008). Nutrition and its relation to mealtime preparation, eating, fatigue and mood among stroke survivors after discharge from hospital—a pilot study. Open Nursing Journal, 2, 15–20. doi:10.2174/1874434600802010015

Westergren A., Hagell P. (2006). Initial validation of the Swedish version of the London Handicap Scale. Quality of Life Research, 15 (7), 1251–1256. doi:10.1007/s11136-006-0054-4

Wright B., Masters G. (1982). Rating scale analysis. Chicago, IL: Mesa Press.


classical test theory; health measurement; patient-reported outcomes; Rasch analysis; stroke; SF-12

© 2014 American Association of Neuroscience Nurses


Article Level Metrics

Search for Similar Articles
You may search for similar articles that contain these same keywords or you may modify the keyword list to augment your search.