Carle, Adam C. PhD*,†,‡; Weech-Maldonado, Robert PhD§
Validly interpreting patients’ reports of their experiences depends on a valid scoring systems. A scale uses a set of questions to measure one or more latent (indirectly observed) constructs. With respect to each construct, a single summary score of responses often serves as an estimate of the construct. However, the possibility exists that, although a scale’s developers intended to measure a single construct, the questions may actually seem to measure more than one. If the questions seem to measure multiple constructs, one should not create a single summed score. Rather, one should create individual scores for each construct.1 Too frequently, individuals have created scales without empirically examining whether the derived scores seem to measure the construct as intended.2
Psychometricians call data measuring a single construct unidimensional and data measuring multiple constructs multidimensional.1,3 Factor analysis offers a tool to empirically evaluate the dimensionality of question sets.3 Empirical fit indices allow investigators to evaluate the validity of different hypothesized measurement structures for a scale. Thus, one can use factor analyses to evaluate the apparent internal measurement structure of a scale, make decisions about unidimensionality or multidimensionality, and subsequently develop a valid scoring system based upon the structure.4 However, in practice, real data are often “messy.” They fail tests of unidimensionalality but, simultaneously, identified multidimensionality may not seem substantively meaningful.2,4–6
For example, a set of questions may measure a single construct (eg, cultural competence), but specific item content shared by item “clusters” (eg, a set of questions related to issues related to a patient’s language) may result in one or more “specific” factors, violating unidimensionality. This may make the scale seem multidimensional when essentially the scale does measure a single construct. The multidimensionality reflects “specific” nuisance factors (also labeled “grouping” factors) with relatively limited substantive meaning. These can occur for several reasons, among them, context effects and item location. For example, a patient safety culture survey may ask several questions specifically related to workgroups and several specific to the broader institution. Each set will include common wording related to the workgroup or institution and this can lead to specific factors.
When specific factors occur in the presence of a single general factor, one can more validly create a single score from the entire item set.2,4–6 This is because the data seem sufficiently unidimensional.5,7,8 From a scoring perspective, then, when data fail unidimensionality tests the question becomes: does the scale “essentially” measure one construct (and one or more nuisance factors) or does it measure multiple meaningful constructs? If it sufficiently measures one construct, this provides evidence for the validity of a single score. If it measures multiple meaningful constructs, one has evidence in favor of multiple scores rather than a single score.
To create valid scoring systems, empirical analyses should always address whether a scale’s measurement structure is consistent with a single construct, multiple constructs, or single construct with nuisance factors. Three general types of factor models correspond to these structures: pure unidimensional models, multidimensional models, and bifactor models, respectively. One can use factor analyses to examine the extent to which these models are consistent with the data. Unlike unidimensional and multidimensional models, bifactor models are not currently well known.
The bifactor model9 explicitly postulates the presence of a general construct (factor) and the presence of one or more nuisance factors (Fig. 1). The bifactor model does not allow any of the factors to correlate. This specification allows one to unambiguously interpret scores on the general factor uninfluenced by the nuisance factors. Thus, when a test of pure unidimensionality fails and a bifactor model fits well, the item set’s data structure is consistent with a single construct and scores from the entire item set measure that construct. One can also draw this conclusion even when a multidimensional model fits the data well, if the bifactor model fits as well as or better than the multidimensional model.
When a test of pure unidimensionality fails and a bifactor model either fails to fit the data well or a multidimensional fits the data better than the bifactor model, this provides evidence against a single score and in favor of multiple scores for each construct. As noted above, the multidimensional model does not specify a general factor (Fig. 2). It specifies individual, meaningful, and potentially correlated factors measured by specific item sets. A well fitting multidimensional model (absolutely and comparatively) provides evidence that the individual factors are not nuisance factors but rather correspond to meaningful constructs. One should create scale scores for each item set.
Few published medical literature–based examples have taken an applied rather than mathematical approach to describing the factor models and their implications generally (as above). This is unfortunate because (when appropriate) the bifactor model can resolve potentially thorny dimensionality issues2,4–6,10 and subsequent difficulty regarding the interpretability of measures used in health and health-related outcomes measurement. The recent development of the Consumer Assessments of Healthcare Providers and Systems Cultural Competence Survey (CAHPS-CC) and the need to interpret patients’ reports using the survey provides an opportunity to present an applied example of the models described above.
Culturally competent care refers to the capacity of health care providers at various levels to engage with patients in a safe, patient and family centered, evidence-based, and equitable manner.11 In response to the need for a CAHPS-based measure of cultural competence, the CAHPS-CC team set out to develop the CAHPS-CC. The team started with the theoretical framework described by Ngo-Metzger et al.12 This theoretically derived framework describes 5 domains of cultural competence: patient-provider communication; respect for patient preferences/shared decision-making; experiences leading to trust or distrust; experiences of discrimination; and language services.12 As described in detail elsewhere,13 development followed several steps to create items to measure these 5 domains: (1) evaluating existing CAHPS surveys to identify existing items that addressed the domains of interest; (2) conducting a literature review to identify existing instruments or item sets previously used to collect data on cultural competence from the patient’s perspective; (3) placing a Federal Register notice with a call for measures; (4) reviewing and adapting existing measures in the public domain; (5) writing new survey items for each of the domains not addressed in steps 1 through 4; and (6) conducting analyses to evaluate the scale’s dimensionality given that previously analyses with this scale or its framework had not occurred. At the end of item development, the Cultural Competence Survey included 27 questions meant to determine whether an experience occurred (as opposed to evaluating the experience). Readers can access the entire survey at: https://www.cahps.ahrq.gov/clinician_group/. The findings we report below detail the analyses that took place at step 6.
Participants came from a field test conducted in 2008 of the CAHPS-CC survey among a stratified random sample (based on race/ethnicity and language) of 6000 adults (18 y and older) Medicaid managed care enrollees in 2 health plans, one in New York (3200) and the other in California (2800). We selected New York and California for the field test given the diversity of their respective populations. The initial sample consisted of: 1200 white English speakers, 1200 black English speakers, 900 Hispanic English speakers, 900 Hispanic Spanish speakers, 900 Asian English speakers, and 900 Asian non-English speakers. The survey consisted of a 2-wave mailing with follow-up telephone interview of nonrespondents. We offered a monetary incentive to nonresponders remaining after the second call attempt. This multipronged approach resulted in a 26% response rate (n=1380). Analyses presented elsewhere13 showed that across assorted variables nonrespondents differed on only a select few variables. Respondents were more likely to be white and older and less likely to be black. After excluding individuals that did not have a personal doctor or a doctor visit during the last 12 months, the final analytic sample constituted 991 respondents.13
To develop multidimensional and bifactor models, we used the structural equation modeling approach described by Reise et al4 to test the fit of specified models. This approach allows one to comparatively evaluate the fit of different models, selecting the best fitting model as the most valid. We examined the fit of each specified model (including the unidimensional model) using empirically validated fit indices and levels suggested by Hu and Bentler14,15: root mean square error of approximation (RMSEA) values <0.05, and comparative fit index (CFI) and Tucker-Lewis Index (TLI) values > 0.95. After identifying a model that fit the data acceptably, we used the relevant index (or indices depending on the model) suggested by Reise et al,5 Bentler,16 and Revelle and Zinbarg17 to describe and compare the proportion of variance each model (and/or its parts if applicable) accounted for. These indices included ω t, ω h, and explained common variance.5,16,17 ω t describes the total proportion of variance accounted for by the model. ω h describes the total proportion of variance accounted for by a hierarchical model’s general factor (eg, the general factor in a bifactor model). Explained common variance describes the percent of variance, a hierarchical (eg, bifactor) model’s general factor accounts for relative to the variance accounted for by the entire model (eg, the amount of variance explained by the general and specific factors). All analyses used Mplus (6.1),18 its theta parameterization and robust weighted least squares estimator and missing data estimation capability to estimate the means and covariances (rather than item level imputation).19
To test whether responses to the CAHPS-CC appeared to measure a single cultural competence construct, we first tested a unidimensional model’s fit. In this model, each item only on a single factor (ie, cultural competence). For statistical identification, we set the mean at zero and factor variance at one. We allowed no correlations among the uniquenesses. This model did not fit well (RMSEA=0.11; TLI=0.88; CFI=0.89) indicating that a single cultural competence factor did not underlie the data. Subsequently we sought to develop and comparatively examine the fit of multidimensional and bifactor models.
We first fit a model that corresponded to the 5 domains identified during item development. This model did not fit well (RMSEA=0.10; TLI=0.90; CFI=0.89). We then conducted an exploratory factor analysis (EFA) and examined the extent to which the EFA’s results diverged from the theoretical model. We used this comparison to empirically modify the theoretical model and arrive at a model consistent with the data. This allowed us to maintain the theoretical model as much as possible.
The EFA suggested splitting the original “Communication” factor into “Positive,” “Negative,” “Alternative-Medicine,” and “Preventive-Care” (labeled using item content) might result in acceptable fit. To minimize modifications, we first split communication into positive and negative factors only. This resulted in improved but not acceptable fit (RMSEA=0.091; TLI=0.91; CFI=0.90). We then split communication into: positive, negative, and alternative-medicine communication. This further improved fit, but not sufficiently (RMSEA=0.090; TLI=0.91; CFI=0.92). We then split communication into: positive, negative, alternative-medicine, and preventive-care communication. This model fit the data well (RMSEA=0.064; TLI=0.98; CFI=0.97). Table 1 details this model. It included 7 factors labeled: Doctor Communication—Positive Behaviors; Doctor Communication—Negative Behaviors; Doctor Communication—Preventive Care; Doctor Communication—Alternative Medicine; Shared Decision Making; Equitable Treatment; and Trust. We did not modify the model further (ω t=0.69).
However, fit indices should not solely guide model selection and a model’s results should make conceptual sense. Thus, we also evaluated whether the factor correlations made conceptual sense. Consider “Communication—Positive” and “Trust”. For Communication—Positive, the response options lead to a factor where high scores indicate “good” communication. Conversely, for Trust, the response options lead to a factor where high scores indicate “poor” trust. Sensibly, as good communication increases (ie, as Communication—Positive values increase) trust ought to decrease (ie, Trust factor values increase). As Table 1 shows, these 2 factors did correlate negatively. The correlations among the remaining factors also made conceptual sense, indicating that the model not only fit well, but made conceptual sense.
To develop a bifactor model, we first attempted to fit a bifactor model that reflected the multidimensional model described above. This model had one general “cultural competence” factor (on which all items loaded) and specific factors for each of the individual factors described above. Items loaded on only 1 specific factor and none of the factors correlated. This model failed to converge. We then attempted bifactor models based on the original theoretical model and the theoretically based empirically modified multidimensional model. All of these models had a general cultural competence factor on which all items loaded. They differed only in the specification of the uncorrelated specific factors. For example, we first specified a bifactor model consistent with the theoretical framework used in scale development. Another model combined each doctor communication factor into a single specific factor, but did not change the remaining specific factors, while another combined only the positive and negative communication factors. None of the theoretically based bifactor models we attempted converged.
We next attempted empirically driven bifactor model development. At a reviewer’s suggestion, we conducted an EFA on the single factor model’s residuals to guide empirically driven subdomain identification. The EFA suggested only a single specific factor. We specified a model with a single grouping factor. This model failed to converge. EFAs with additional factors failed to converge, precluding using EFA to identify multiple grouping factors.
Also at the reviewer’s suggestion, we considered models derived from the single factor model’s modification indices. A model with 1 residual covariance based on the largest single-factor model modification indices failed to converge. A model specifying 2 specific factors did not adequately fit the data (RMSEA=0.093; TLI=0.91; CFI=0.90). A model specifying 3 specific factors also did not fit well (RMSEA=0.09; TLI=0.91; CFI=0.91). Given the number of bifactor models attempted and our concern that additional model modifications would potentially capitalize too much on chance, we did not attempt additional bifactor models. Given the failure of several bifactor models to converge and the fit of the bifactor models that did, we concluded the data were inconsistent with a bifactor model.
Using confirmatory unidimensional, multidimensional, and bifactor models, we sought to address whether the CAHPS-CC’s measurement structure seems most consistent with a model specifying a general cultural competence construct or a model specifying multiple cultural competence constructs. Our results provide preliminary evidence that it measures 7 constructs: Doctor Communication—Positive Behaviors; Doctor Communication—Negative Behaviors; Doctor Communication—Preventive Care; Doctor Communication—Alternative Medicine; Shared Decision Making; Equitable Treatment; and Trust. Our results provide preliminary evidence that investigators should use the CAHPS-CC to form scales for each construct and not derive an overall cultural competence score from this item set.
Our results demonstrate the need for researchers to empirically examine the apparent dimensionality of data resulting from scales. Despite best efforts, a set of questions may not seem to measure a construct (or constructs) as expected.20 Scoring systems following theoretical rather than empirical measurement structure (when they differ) may result in spurious and invalid conclusions. In our data, one can see that neither the initially hypothesized structure of 5 domains nor a single cultural competence construct appeared to describe question responses well. Rather, our findings suggested 7 domains of cultural competence measured by 7 scale scores. Given the potential for real data to fail to correspond to theoretical expectations, investigators should always evaluate dimensionality.
The interpretability and validity of scores derived from measurement instruments depends upon whether the questions included in a composite score measure a single coherent construct. If a scale is intended to measure a single construct, but models suggest a measurement structure consistent with multiple substantive constructs, one should not create a single summary score from the entire set of questions. This can lead to spurious results and invalid conclusions. Rather, one should create subscale scores for each of the constructs measured by the questions on the scale.3–6,9,10,20
Before concluding, we note some limitations. First, we do not intend our paper as a “playbook.” Rather, we present it as an applied perspective on the importance of evaluating dimensionality and the use of bifactor and multidimensional factor models in that pursuit. Second, our paper has focused on whether or not individuals should create single or multiple scores from item sets. Although these analyses address a core question with respect to score development and interpretability, they do not address how to best derive the subsequent scores [eg, sum of observed responses, item response theory (IRT)-based methods, etc.]. Although sum scores can provide reasonable construct estimates, other methods (ie, IRT) generally provide more precise estimates. Third, our paper focuses on establishing dimensionality rather than IRT scaling. Thus, we do not address that even if a researcher selects a bifactor over a multidimensional model, they must still examine whether the presence of the nuisance factors biases IRT parameters.5 We also note that: our 2 state sample may not generalize to the entire population, Medicaid or otherwise; offering a monetary incentive to nonresponders only may have introduced sampling bias; and the low response rate limits generalizability. Finally, before making strong conclusions about a model’s general acceptability, researchers should attempt to replicate their findings in a separate sample. Though we did not have additional data in which to replicate our model, a separate team working with independent data did replicate our model.21
In conclusion, in this paper, we took an applied approach and demonstrated confirmatory unidimensional, multidimensional, and bifactor modeling techniques as tools to evaluate dimensionality and guide scoring. Our analyses demonstrate the critical importance of empirically evaluating dimensionality. Without conducting these analyses, the validity of scores derived from item sets remains in doubt.
Adam Carle thanks Tara J. Carle and Lyla S. B. Carle whose thoughtful comments and unending support make his work possible.
1. McDonald RP Test Theory: A Unified Treatment. 1999 Mahwah, NJ Erlbaum
2. Carle AC, Cella D, Cai L, et al. Advancing PROMIS’s methodology: results of the Third Patient-Reported Outcomes Measurement Information System (PROMIS®) Psychometric Summit. Expert Rev Pharmacoeconomics Outcomes Res. 2011;11:677–684
3. Bollen K Structural Equations With Latent Variables. 1989 New York, NY Wiley
4. Reise SP, Ventura J, Keefe RSE, et al. Bifactor and item response theory analyses of interviewer report scales of cognitive impairment in schizophrenia. Psychol Assess. 2011;23:245–261
5. Reise S, Moore T, Haviland M. Bifactor models and rotations: exploring the extent to which multidimensional data yield univocal scale scores. J Pers Assess. 2010;92:544–559
6. Reise SP, Morizot J, Hays RD. The role of the bifactor model in resolving dimensionality issues in health outcomes measures. Qual Life Res. 2007;16(suppl 1):19–31
7. Lai J-S, Crane PK, Cella D. Factor analysis techniques for assessing sufficient unidimensionality of cancer related fatigue. Qual Life Res. 2006;15:1179–1190
8. Reeve BB, Hays RD, Bjorner JB, et al. Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Med Care. 2007;45(suppl 1):S22–S31
9. Holzinger KJ, Swineford F. The bi-factor method. Psychometrika. 1937;2:41–54
10. Reise S, Moore T, Maydeu-Olivares A. Target rotations and assessing the impact of model violations on the parameters of unidimensional item response theory models. Educ Psychol Meas. 2011;71:684–711
11. National Quality Forum. Endorsing a framework and preferred practices for measuring and reporting culturally competent care quality. Washington, DC; 2008
12. Ngo-Metzger Q, Telfair J, Sorkin D, et al. Cultural Competency and Quality of Care: Obtaining the Patient’s Perspective. 2006 New York, NY Commonwealth Fund
13. Weech-Maldonado R, Carle AC, Weidmer B, et al. Assessing cultural competency from the patient’s perspective: the CAHPS Cultural Competency (CC) Item Set. Working paper: Department of Health Services Administration, University of Alabama at Birmingham; 2010
14. Hu L, Bentler P. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Structural Equation Modeling. 1999;6:1–55
15. Hu L, Bentler PM. Fit indices in covariance structure modeling: sensitivity to underparameterized model misspecification. Psychol Methods. 1998;3:424–453
16. Bentler PM. Alpha, dimension-free, and model-based internal consistency reliability. Psychometrika. 2009;74:137–143
17. Revelle W, Zinbarg RE. Coefficients alpha, beta, omega, and the glb: comments on Sijtsma. Psychometrika. 2009;74:145–154
18. Muthén LK, Muthén BO Mplus User’s Guide. 2009 Los Angeles, CA Muthén & Muthén
19. Little R, Rubin DB Statistical Analysis With Missing Data. 2002;Vol. 2 New York John Wiley
20. Carle AC, Blumberg SJ, Moore KA, et al. Advanced psychometric methods for developing and evaluating cut-point-based indicators. Child Indicators Res. 2011:1–26
21. Stern RJ, Fernandez A, Jacobs EA, et al. Advances in measuring culturally competent care: a confirmatory factor analysis of CAHPS-CC in a safety-net population. Med Care. 2012;50(suppl 2):S49–S55
© 2012 Lippincott Williams & Wilkins, Inc.