Compared with MG- confirmatory factor analytic, MG-MIMIC models more fully address heterogeneity within and across groups because MG-MIMIC simultaneously allow background variables to directly affect responses to items measuring the latent variable and MG-MIMIC let measurement bias due to background variables differ across the groups. In this way, MG-MIMIC models partial out background variables' effects on measurement and better allow analytical examinations of measurement bias across a variable (eg, race and ethnicity) with the background variables' effects removed. Additionally, once one has fit a model, one can use model-based estimates to compare the health of various groups, removing the systematic error that measurement bias introduces.
This framework allows investigators to better conduct CER and mitigate systematic measurement error's negative effects using model-based estimates. In the remainder of the manuscript, using data from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC),7 a MG-MIMIC analysis addressing alcohol abuse, one of the 14 priority CER conditions identified by the Agency for Healthcare Research and Quality was described. The study shows how systematic measurement error as a function of poverty status, educational attainment, and minority status can lead to erroneous conclusions about alcohol abuse, and, how model-based estimates can mitigate this error and lead to more accurate conclusions. (For a detailed description of the steps and methods involved in conducting these analyses, interested readers should consult one of the several excellent treatments of item response theory, structural equation modeling, and MG-MIMIC).1–5,8
Participants (n = 25,512; 16,480 non-Hispanic white (hereafter white), 4139 non-Hispanic black/African-Americans (hereafter black/African-American), and 4893 Hispanic) were a subset of the publicly available 2001–2002 NESARC data designed and sponsored by the National Institute for Alcohol Abuse and Alcoholism. The original sample consisted of 43,093 individuals 18 years and older representing the noninstitutionalized adult US population. The complex, multistage design oversampled black/African-American, Hispanics, and adults aged 18 to 24 years. Sample weights adjust the data to make it represent the civilian noninstitutionalized US population.7 The current study included white, black/African-American, and Hispanic participants with complete data who reported alcohol consumption in the past 12 months. Small sample sizes precluded including other groups.
Alcohol abuse is a maladaptive alcohol use pattern that occurs in the absence of alcohol dependence and leads to significant impairment or distress. It demonstrates at least one of the following criteria: (1) continued use despite a social or interpersonal problem caused or exacerbated by the effects of drinking, (2) recurrent drinking in situations in which alcohol use is physically hazardous, (3) recurrent drinking resulting in a failure to fulfill major role obligations, or (4) recurrent alcohol-related legal problems.9 The NESARC's Alcohol Use Disorder and Associated Disabilities Interview Schedule-IV,10 which has demonstrated acceptable psychometric properties in the general population,11–17 uses 10 dichotomous items to operationalize these criteria (Table 1). All 10 items were used for the analyses.
Five options coded race: “American Indian and Alaska Native”; “Asian”; “Black or African American”; “Native Hawaiian and Other Pacific Islander”; and “White.” A single item allowed Hispanic self-identification. The individuals were considered white if they identified as both white and non-Hispanic, black/African-American if they identified as both black/African-American and non-Hispanic, and regarded anyone who self-identified as Hispanic a Hispanic.
Participants reported their total past 12 months' personal and family incomes. From this, the NESARC estimated household income. Individuals were coded as those living households at or below 200% of the US 2001 federal poverty level or not.
Participants indicated their highest grade or year of school completed. The individuals were coded as either having completed high school or not.
Following Millsap and Yun-Tien's method,18 the measurement invariance using a series of hierarchically nested models were examined. It was started with the least constrained cross-group model and added cross-group equivalence constraints in the measurement parameters in a stepwise fashion in later models. For example, after establishing the single factor alcohol abuse model's tenability across the groups, the invariance in the direct effects of poverty status and educational attainment across the groups was examined. The fit indices and levels identified by the literature was used19–22 to evaluate the tenability of the equivalence constraints at each step: root mean square error of approximation values less than 0.05; and comparative fit index and Tucker-Lewis Index values greater than 0.95, where fit refers to the model's ability to reproduce the covariance among the items. The χ2 difference test was also used (Δχ2) to examine the relative deterioration in model fit resulting from the constraints added at each step. After identifying bias using Δχ2, the item level comparisons to identify bias' source and modify the model was used. To do this, the fit of a model that constrained one “new” parameter to equality across the groups (eg, the loading for a single item) was compared with the fit of an otherwise identical model without the constraint. Constraints that led to significantly decreased fit identified measurement bias. Subsequently these constraints were freed to develop a partial invariance model that directly modeled measurement bias and allowed for more accurate abuse estimates. All analyses used Mplus21 and its robust weighted least squares estimator. Additionally, the theta parameterization was used, which includes the residual variances as parameters in the model. The complex sampling design and design weights in Mplus was appropriately incorporated.22 Zero-weighting was used23 to create the subsample and simultaneously maintain sampling information. See Korn and Graubard23 or Carle24 for details regarding weighting.
Evaluating Systematic Measurement Error
Given previous work,25–27 a single factor alcohol abuse model was initially tested (Model 1) across whites, black/African-Americans, and Hispanics. Model 1 allowed poverty status and educational attainment each to have direct effects on each of the items (within statistical identification limits) and allowed poverty status and educational attainment to correlate. For statistical identification, model 1 fixed the factor mean and variance at 1 and 0 for whites, while freely estimating the black/African-American and Hispanic means and variances. Additional statistical identification constraints required constraining all groups' item intercepts to zero, fixing the direct effect of poverty status and educational attainment on the “ride in car as passenger while drinking” item to zero in all groups, constraining the loading for the “ride” item to equality across the groups, constraining the threshold for the “ride” item to equality across the groups, and fixing the uniquenesses to one for all groups. Model 1 included no other constraints.
In models like these, one must impose statistical constraints to achieve a unique solution.18 Typically, one chooses a “reference” group (here whites) and sets this group's mean and variance to zero and one, respectively. This sets the latent variable's metric to standard normal. The additional cross-group constraints lead to results in the same latent metric and one can interpret the parameters relative to each other. The statistical identification constraints which were chosen represent the minimal constraints necessary. However, one could identify the model across groups using a different “anchor” item. If bias exists in the anchor item, this can influence the results. Thus, to examine the possibility that using the “ride” item as an anchor for cross-group statistical identification might influence the results, the entire set of analyses was iterated using each of the other items as anchors. In none did the “ride” item exhibit bias. Likewise, with one minor exception, these iterations arrived at the exact same final model that is discussed below. This strongly supports the “ride” item as an anchor and minimizes concerns that analyses using different statistical identification would diverge from these. The results will gladly be provided to interested readers.
Model 1 fit the data well (root mean square error of approximation = 0.021, comparative fit index = 0.98, Tucker-Lewis Index = 0.98, χ2 = 281.75, 39, n = 25,512, P < 0.01). Given good fit, model 2 was tested, which constrained the direct effects of poverty status and educational attainment to zero across all groups. These constraints led to statistically significant misfit (Δχ2 = 191.193, 24, n = 25,512, P < 0.01), indicating measurement bias as a function of poverty status and educational attainment. Space constraints limit listing each set of differences here. However, Table 1 details the final results. Bolded values in the table highlight differences in the measurement parameters across the groups. For example, with respect to educational attainment, analyses revealed that at the same level of alcohol abuse, more highly educated as compared with less educated whites had a greater likelihood of endorsing that alcohol caused trouble with family/friends. As a converse example, more highly educated Hispanics were less likely to indicate that they entered harmful situations while drinking. Related to the direct effects of poverty and educational attainment, analyses also indicated that the direct effects of poverty status on the “fights” and “legal” items, as well as the direct effects of educational attainment on “drive while drinking” and “drive after drinking,” did not differ across whites and black/African-Americans. Model 2b relaxed the constraints causing misfit.
Model 3 modified model 2b to constrain the loadings to equivalence across groups. This model examined whether the items related similarly to alcohol abuse across whites, black/African-Americans, and Hispanics, after accounting for systematic measurement error due to poverty status and educational attainment. Constraining the loadings resulted in statistically significant misfit (Δχ2 = 30.40, 14, n = 25,512, P < 0.01) indicating bias as a function of race/ethnicity. Analyses indicated that 3 equality constraints led to the misfit. As Table 1 shows, responses to the items provided less reliable measurement for the minority groups (ie, smaller loading values for minorities). Analyses also indicated that black/African-Americans and Hispanics values differed significantly from each other. Model 3b relaxed these constraints.
Model 4 modified model 3b to constrain the thresholds to equality across whites, black/African-Americans, and Hispanics. This model examined whether affirmative item endorsements had similar likelihoods across race and ethnicity. Constraining the thresholds resulted in statistically significant misfit (Δχ2 = 88.87, 13, n = 25,512, P < 0.01), indicating measurement bias. Analyses showed that 8 equality constraints led to misfit. Table 1 summarizes these differences. But, as one example, compared with whites at the same alcohol abuse level, black/African-Americans as well as Hispanics were more likely to endorse drinking while driving. Finally, analyses also indicated that black/African-American and Hispanic threshold values for the drinking while driving, driving after drinking, and entering harmful situations did not differ from each other, though they did differ from those of whites. The final model relaxed these ill-fitting constraints.
Summarily, analyses revealed statistically significant measurement bias across race and ethnicity, even after accounting for bias due to differential poverty status and educational attainment. The final model incorporated numerous direct effects of poverty status and educational attainment on alcohol abuse measurement and incorporated several differences in alcohol abuse measurement across whites, black/African-Americans, and Hispanics.
Mitigating Systematic Measurement Error
The presence of significant measurement bias across individuals of different income, educational, and race/ethnic backgrounds indicates that one should not use unadjusted scores to measure alcohol abuse. Rather, one should use model-based estimates of alcohol abuse levels to mitigate systematic error. To demonstrate the importance of using model-based estimates that mitigate systematic measurement error, the model-based estimates that resulted from the final measurement model incorporating measurement differences was compared with estimates that resulted from a model ignoring systematic measurement error. Under the model ignoring measurement bias, Whites served as the reference group and had a mean of zero (for statistical identification). Both black/African-Americans and Hispanics had significantly greater mean alcohol abuse levels(MBlack/African-American=0.43:z=7.81;MHispanic=0.173:z=2.11). However, under the model mitigating systematic measurement error, black/African-Americans no longer differed significantly from whites (MBlack/African-American=0.136:z=1.86) and Hispanics had significantly lower alcohol abuse levels (MHispanic=−0.261:z=−2.19).
In this study, the importance of empirically evaluating systematic measurement error's present in CER is shown. Additionally, it was intended to demonstrate how systematic measurement error can influence analytic results and how model-based techniques can mitigate this error. The author also endeavored to briefly describe the mathematical and methodological tools used to conduct analyses probing for and correcting systematic measurement error. Using MG-MIMIC models, in the current example, the poverty status, educational attainment, and race and ethnicity all directly influencing alcohol abuse measurement was shown. Without accounting for systematic measurement error due to these sources, one would conclude that Hispanics and black/African-American demonstrate significantly greater amounts of alcohol abuse behavior than whites. However, after using model-based estimates of alcohol abuse that corrected for measurement bias and mitigated systematic measurement error, model-based estimates clarified that Hispanics demonstrate significantly lower amounts of alcohol abuse behavior in comparison to whites and that black/African-Americans do not differ significantly from whites in their alcohol abuse behavior. Without using model-based estimates, efforts to comparatively evaluate treatment effectiveness across these populations would result in flawed conclusions, especially in observational research.
These findings highlight that CER using self-reported measures in heterogeneous sample should consider whether group differences (or similarities) reflect true differences (or similarities) or whether group differences (or similarities) result from systematic measurement error. Additionally, these results indicate that systematic reviews based on research not accounting for systematic measurement error may reflect spurious findings. Until CER that uses self-report measures more effectively examines cross-group measurement reliability and validity equivalence for the measured variables, the validity of conclusions will remain clouded.
Despite this study's strengths, its limits deserve review. The Hispanic subgroups could not be explored given small subgroup sample sizes. This may miss additional heterogeneity. Second, a representative sample was used; it remains unclear whether these results would hold in a clinical sample. Finally, a demonstration within an actual CER study was not provided. Rather, the statistical method for mitigating bias was focused. The dataset used in this study did not lend itself to both examining bias and CER questions. While these concerns leave issues unaddressed, they do not impede the study's ability to serve as an example of the statistical method.
Finally, to extend the example given in the introduction to the current results, suppose the dataset which was used here included detailed information about different treatments that respondents had received, then, rather than using the biased observed scores, one would use the model-based scores that adjusted for measurement bias in analyses to select individuals and to compare differences in alcohol abuse levels across treatments and subgroups. In this way, one would achieve more accurate estimates of differences in alcohol abuse across the heterogeneous individuals included in the study.
Summarily, systematic measurement error (ie, measurement bias) may profoundly influence CER results. Despite equivalence in underlying values of a measured health construct people may respond dissimilarly to questions about their health as a function of various background variables. Without testing for measurement bias' presence, it will remain unclear whether treatment evaluations demonstrating treatment effectiveness (or failures) across a heterogeneous population reflect true differences or systematic measurement error. Importantly, MG-MIMIC models offer a tool to simultaneously investigate and mitigate systematic measurement error. Model-based estimates of the health construct corrected for systematic measurement error will lead to more valid treatment effectiveness comparisons across heterogeneous groups. Regardless of the specific fields in which investigators work, researchers should consider MG-MIMIC models as a tool to incorporate into the CER research tool box.
1. Muthén BO. Latent variable modeling in heterogeneous populations. Psychometrika
2. Jones RN. Identification of measurement differences between English and Spanish language versions of the mini-mental state examination. Detecting differential item functioning using MIMIC modeling. Med Care
3. Jones RN. Racial bias in the assessment of cognitive functioning of older adults. Aging Ment Health
4. Bollen KA. Structural Equations With Latent Variables.
New York, NY: John Wiley & Sons; 1989.
5. Muthén B. Contributions to factor analysis of dichotomous variables. Psychometrika
6. Muthén B. A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika
7. Grant BF, Kaplan K, Shepard J, et al. Source and Accuracy Statement for Wave 1 of the 2001–2002 National Epidemiologic Survey on Alcohol and Related Conditions.
Bethesda, MD: National Institute on Alcohol Abuse and Alcoholism; 2003.
8. Embretson S, Reise SP. Item Response Theory for Psychologists
. Hillsdale, NJ: Lawrence Erlbaum Associates; 2000.
9. American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders
. 4th ed. Washington, DC: American Psychiatric Association; 1994.
10. Grant BF, Dawson DA, Hasin DS. The Alcohol use Disorder and Associated Disabilities Interview Schedule-DSM-IV Version (AUDADIS-IV).
Bethesda, MD: National Institute on Alcohol Abuse and Alcoholism; 2001.
11. Grant BF. Convergent validity of DSM-III-R and DSM-IV alcohol dependence: results from the National Longitudinal Alcohol Epidemiologic Survey. J Subst Abuse
12. Grant BF. Theoretical and observed subtypes of DSM-IV alcohol abuse and dependence in a general population sample. Drug Alcohol Depend
13. Harford, Muthén BO. The dimensionality of alcohol abuse and dependence: a multivariate analysis of DSM-IV symptom items in the national longitudinal survey of youth. J Stud Alcohol
14. Grant BF, Harford TC, Dawson DD, et al. The Alcohol Use Disorder and Associated Disabilities Interview Schedule (AUDADIS): reliability of alcohol and drug modules in a general population sample. Drug Alcohol Depend
15. Hasin DS, Grant B, Cottler L. Nosological comparisons of alcohol and drug diagnoses: a multisite, multi-instrument international study. Drug Alcohol Depend
16. Hasin D, Carpenter KM, McCloud S, et al. The Alcohol Use Disorder and Associated Disabilities Interview Schedule (AUDADIS): reliability of alcohol and drug modules in a clinical sample. Drug Alcohol Depend
17. Hasin D, Paykin A. Alcohol dependence and abuse diagnoses: concurrent validity in a nationally representative sample. Alcohol Clin Exp Res
18. Millsap RE, Yun-Tein J. Assessing factorial invariance in ordered-categorical measures. J Multivar Behav Res
19. Hu L, Bentler PM. Fit indices in covariance structure modeling: sensitivity to underparameterized model misspecification. Psychol Methods
20. Hu L, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Struct Equ Modeling
21. Muthén LK, Muthén BO. Mplus User's Guide
. 4th ed. Los Angeles, CA: Muthén & Muthén; 1998–2007.
22. Steiger JH. A note on multiple sample extensions of the RMSEA fit index. Struct Equ Modeling
23. Korn E, Graubard BI. Estimating variance components by using survey data. J R Stat Soc Series B Stat Methodol
24. Carle AC. Fitting multilevel models in complex survey data with design weights: recommendations. BMC Med Res Methodol
25. Muthén BO, Hasin D, Wisnicki KS. Factor analysis of ICD-10 symptom items in the 1988 National Health Interview Survey on alcohol dependence. Addiction
26. Muthén BO. Factor analysis of alcohol abuse and dependence symptom items in the 1988 National Health Interview Survey. Addiction
27. Carle AC. Assessing the adequacy of self-reported alcohol abuse measurement across time and ethnicity: cross-cultural equivalence across Hispanics and Caucasians in 1992, non-equivalence in 2001–2002. BMC Public Health
This article has been cited
Keywords:© 2010 Lippincott Williams & Wilkins, Inc.
comparative effectiveness research; measurement bias; differential item functioning; cross-cultural differences