Mitigating Systematic Measurement Error in Comparative Effectiveness Research in Heterogeneous Populations : Medical Care

Secondary Logo

Journal Logo

Comparative Effectiveness

Mitigating Systematic Measurement Error in Comparative Effectiveness Research in Heterogeneous Populations

Carle, Adam C. PhD

Author Information
doi: 10.1097/MLR.0b013e3181d59557
  • Free


Accurately understanding treatment effectiveness across subpopulations requires equally reliable and valid measurement in each group. Measurement bias refers to the possibility that individuals with identical health respond dissimilarly to questions about their health as a function of their race, ethnicity, or other variable. For example, 2 people with equivalent drinking behaviors may respond to questions about their drinking differently due to culturally divergent beliefs about drinking's social acceptability. One may feel socially comfortable discussing their drinking, describing it fully, while another may feel uncomfortable discussing their drinking and report less drinking than actually occurred. Consequently, despite equivalent underlying pathology, the 2 individuals would appear dissimilar based on their question responses. As a result, comparative effectiveness research (CER) seeking to classify individuals based on their responses to questions about their health and/or evaluate treatment outcomes based on individuals' responses, would include systematic flaws.

For example, suppose an investigator seeks to compare the effectiveness of different alcohol abuse treatments in a sample that will include individuals of different racial and ethnic backgrounds. Moreover, suppose the investigator also intends to compare these treatments' effectiveness across whites, black/African-Americans, and Hispanics. Finally, suppose the investigator will use a self-report measure to select individuals with above average alcohol abuse behaviors (ie, problematic levels) and subsequently compare levels of alcohol abuse behaviors across treatments and across race and ethnicity within treatments to compare treatment effectiveness. If identical observed responses actually represent different behavior levels among whites, black/African-Americans, and Hispanics, the investigator will erroneously select individuals (eg, include individuals in the problematic group who should not be included) and erroneously draw conclusions about differences in levels of abuse behaviors across whites, black/African-Americans, and Hispanics. The investigator needs a method of evaluating and mitigating this possible source of error.

Modern measurement theory offers model-based methods to test for measurement bias. In addition to investigating bias, these methods can correct for bias, allowing more valid comparisons across heterogeneous groups. Bias can obscure differences, decrease reliability and validity, and may (but does not always) render group comparisons based on the observed responses impossible. Without establishing equivalent measurement (or mitigating nonequivalent measurement's effects), the field cannot: (1) comparatively evaluate what works best for whom, (2) draw strong conclusions about disparate outcomes, (3) support evidence-based practice and policy, and, (4) address health disparities. Unfortunately, the CER literature has not discussed model-based methods for testing and mitigating measurement bias. Additionally, few researchers receive training in these model-based approaches, one of which is described here.


Multiple-group (MG), multiple-indicator, multiple-cause (MIMIC) models offer a potent method to investigate the potentially confounding effects of several background variables simultaneously (eg, race, ethnicity, poverty status, and educational attainment).1–4 MG-MIMIC models build on structural equation modeling and item response theory to extend “traditional” MG-confirmatory factor analytic models by incorporating additional background variables as covariates in a structural equation measurement model.1–3 Rather than limiting analyses to a single variable as traditional approaches do, the MG-MIMIC approach simultaneously controls for differences in responses due to some variables (eg, education and poverty status) and allows an investigation of measurement bias across another (eg, race and ethnicity).2

MG-MIMIC develop mathematical models to describe individuals' responses to questions. Equations describe the relations among item responses and provide a canvas against which to test measurement bias hypotheses. Specifically, let Xij equal the ith individual's score on the jth ordered-categorical item (question), let the number of items equal p (j = 1,2,.,p), and let the number of item responses range (0,1,..,s). For simplicity, consider a dichotomous item (ie, responses 0 or 1). The model assumes that a latent response variate, Xij*, determines item responses. The latent response variate corresponds to the idea that, although observed item responses fall into discrete categories (eg, no/yes), an underlying continuum represents the possible item responses. A threshold value on the latent response variate determines responses. If an individual's value on the latent response variate is less than the threshold, the individual will not endorse the item (ie, will say “no”), but, if their value on the latent response variate is greater than the threshold, the individual will endorse the item. Formally:

where, νj1 is the latent threshold parameters for the ith dichotomous item.

After defining the thresholds and latent response variates, the model further supposes that some factor(s), ξ, is responsible for responses and relates Xij* to the factor(s) as follows:

τj is a latent intercept parameter, λj′ is an r × 1 vector of factor loadings for the jth variable on r factors, ξi is the r × 1 vector of factor scores for the ith person, and εij is the jth unique factor score for that person. The loadings, similar to correlations, represent the degree to which an item relates to the factor(s); the greater the value of the factor loading, the greater the relation between the item and the latent variable. Intercept parameters give the expected value of an item when the value of the underlying factor(s) is zero. The uniqueness include sources of variance not attributable to the factor(s), including measurement error.5,6

Through 2 equations, MG-MIMIC models expand Eq. (2) to include background covariate(s) that can directly influence the latent variable's measurement and the latent variable itself. The first allows the covariate to directly influence the measurement of the latent trait:

The second, a structural equation, allows the covariate to predict the latent variable:

α describes the latent trait's mean value, ζ indicates residuals in the structural model, and γ captures the covariate's influence on the latent variable. To investigate measurement bias, one subscripts measurement parameters to allow for group differences (eg, Xijg*=τjg+λjg′ξigκigxig+εijg). Then, one constrains some or all of the measurement parameters to equality across groups and tests the constrained model's fit compared with a less constrained model. If model fit indices indicate the constraints' acceptability, measurement equivalence exists. If not, measurement bias presents. Figure 1 presents a MG-MIMIC. The modeling process essentially examines whether one can assign equivalent values to each measurement path in the figure. If so, measurement equivalence exists. If not, measurement bias presents.

A hypothetical multiple-group multiple-indicator multiple-cause (MG-MIMIC) model for 2 groups, a single covariate, and a single latent variable measured by 5 questions. The arrow from the covariate to the latent variable represents the effect of the covariate on the latent variable. The arrow from the covariate to a latent response variate indicates that the covariate can directly influence the latent variable's measurement (theoretically an arrow could go from the large square to any of the small circles for a given group). It represent measurement bias due to the covariate. The arrows from the latent variable to the latent response variates indicate that the latent response variates measure the latent variable. The arrows from the latent response variates to the observed responses notate the relationship between the continuous latent response variates and their corresponding discrete item response options. The short arrows pointing directly at the latent response variates indicate unique sources of variance unattributable to the latent variable. To investigate measurement bias attributable to the grouping variable, one would examine the equivalence of the measurement paths across the groups.

Compared with MG- confirmatory factor analytic, MG-MIMIC models more fully address heterogeneity within and across groups because MG-MIMIC simultaneously allow background variables to directly affect responses to items measuring the latent variable and MG-MIMIC let measurement bias due to background variables differ across the groups. In this way, MG-MIMIC models partial out background variables' effects on measurement and better allow analytical examinations of measurement bias across a variable (eg, race and ethnicity) with the background variables' effects removed. Additionally, once one has fit a model, one can use model-based estimates to compare the health of various groups, removing the systematic error that measurement bias introduces.

This framework allows investigators to better conduct CER and mitigate systematic measurement error's negative effects using model-based estimates. In the remainder of the manuscript, using data from the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC),7 a MG-MIMIC analysis addressing alcohol abuse, one of the 14 priority CER conditions identified by the Agency for Healthcare Research and Quality was described. The study shows how systematic measurement error as a function of poverty status, educational attainment, and minority status can lead to erroneous conclusions about alcohol abuse, and, how model-based estimates can mitigate this error and lead to more accurate conclusions. (For a detailed description of the steps and methods involved in conducting these analyses, interested readers should consult one of the several excellent treatments of item response theory, structural equation modeling, and MG-MIMIC).1–5,8



Participants (n = 25,512; 16,480 non-Hispanic white (hereafter white), 4139 non-Hispanic black/African-Americans (hereafter black/African-American), and 4893 Hispanic) were a subset of the publicly available 2001–2002 NESARC data designed and sponsored by the National Institute for Alcohol Abuse and Alcoholism. The original sample consisted of 43,093 individuals 18 years and older representing the noninstitutionalized adult US population. The complex, multistage design oversampled black/African-American, Hispanics, and adults aged 18 to 24 years. Sample weights adjust the data to make it represent the civilian noninstitutionalized US population.7 The current study included white, black/African-American, and Hispanic participants with complete data who reported alcohol consumption in the past 12 months. Small sample sizes precluded including other groups.


Alcohol Abuse

Alcohol abuse is a maladaptive alcohol use pattern that occurs in the absence of alcohol dependence and leads to significant impairment or distress. It demonstrates at least one of the following criteria: (1) continued use despite a social or interpersonal problem caused or exacerbated by the effects of drinking, (2) recurrent drinking in situations in which alcohol use is physically hazardous, (3) recurrent drinking resulting in a failure to fulfill major role obligations, or (4) recurrent alcohol-related legal problems.9 The NESARC's Alcohol Use Disorder and Associated Disabilities Interview Schedule-IV,10 which has demonstrated acceptable psychometric properties in the general population,11–17 uses 10 dichotomous items to operationalize these criteria (Table 1). All 10 items were used for the analyses.

Final Multiple Group-MIMIC Model Results


Five options coded race: “American Indian and Alaska Native”; “Asian”; “Black or African American”; “Native Hawaiian and Other Pacific Islander”; and “White.” A single item allowed Hispanic self-identification. The individuals were considered white if they identified as both white and non-Hispanic, black/African-American if they identified as both black/African-American and non-Hispanic, and regarded anyone who self-identified as Hispanic a Hispanic.

Poverty Status

Participants reported their total past 12 months' personal and family incomes. From this, the NESARC estimated household income. Individuals were coded as those living households at or below 200% of the US 2001 federal poverty level or not.

Educational Attainment

Participants indicated their highest grade or year of school completed. The individuals were coded as either having completed high school or not.


Following Millsap and Yun-Tien's method,18 the measurement invariance using a series of hierarchically nested models were examined. It was started with the least constrained cross-group model and added cross-group equivalence constraints in the measurement parameters in a stepwise fashion in later models. For example, after establishing the single factor alcohol abuse model's tenability across the groups, the invariance in the direct effects of poverty status and educational attainment across the groups was examined. The fit indices and levels identified by the literature was used19–22 to evaluate the tenability of the equivalence constraints at each step: root mean square error of approximation values less than 0.05; and comparative fit index and Tucker-Lewis Index values greater than 0.95, where fit refers to the model's ability to reproduce the covariance among the items. The χ2 difference test was also used (Δχ2) to examine the relative deterioration in model fit resulting from the constraints added at each step. After identifying bias using Δχ2, the item level comparisons to identify bias' source and modify the model was used. To do this, the fit of a model that constrained one “new” parameter to equality across the groups (eg, the loading for a single item) was compared with the fit of an otherwise identical model without the constraint. Constraints that led to significantly decreased fit identified measurement bias. Subsequently these constraints were freed to develop a partial invariance model that directly modeled measurement bias and allowed for more accurate abuse estimates. All analyses used Mplus21 and its robust weighted least squares estimator. Additionally, the theta parameterization was used, which includes the residual variances as parameters in the model. The complex sampling design and design weights in Mplus was appropriately incorporated.22 Zero-weighting was used23 to create the subsample and simultaneously maintain sampling information. See Korn and Graubard23 or Carle24 for details regarding weighting.


Evaluating Systematic Measurement Error

Given previous work,25–27 a single factor alcohol abuse model was initially tested (Model 1) across whites, black/African-Americans, and Hispanics. Model 1 allowed poverty status and educational attainment each to have direct effects on each of the items (within statistical identification limits) and allowed poverty status and educational attainment to correlate. For statistical identification, model 1 fixed the factor mean and variance at 1 and 0 for whites, while freely estimating the black/African-American and Hispanic means and variances. Additional statistical identification constraints required constraining all groups' item intercepts to zero, fixing the direct effect of poverty status and educational attainment on the “ride in car as passenger while drinking” item to zero in all groups, constraining the loading for the “ride” item to equality across the groups, constraining the threshold for the “ride” item to equality across the groups, and fixing the uniquenesses to one for all groups. Model 1 included no other constraints.

In models like these, one must impose statistical constraints to achieve a unique solution.18 Typically, one chooses a “reference” group (here whites) and sets this group's mean and variance to zero and one, respectively. This sets the latent variable's metric to standard normal. The additional cross-group constraints lead to results in the same latent metric and one can interpret the parameters relative to each other. The statistical identification constraints which were chosen represent the minimal constraints necessary. However, one could identify the model across groups using a different “anchor” item. If bias exists in the anchor item, this can influence the results. Thus, to examine the possibility that using the “ride” item as an anchor for cross-group statistical identification might influence the results, the entire set of analyses was iterated using each of the other items as anchors. In none did the “ride” item exhibit bias. Likewise, with one minor exception, these iterations arrived at the exact same final model that is discussed below. This strongly supports the “ride” item as an anchor and minimizes concerns that analyses using different statistical identification would diverge from these. The results will gladly be provided to interested readers.

Model 1 fit the data well (root mean square error of approximation = 0.021, comparative fit index = 0.98, Tucker-Lewis Index = 0.98, χ2 = 281.75, 39, n = 25,512, P < 0.01). Given good fit, model 2 was tested, which constrained the direct effects of poverty status and educational attainment to zero across all groups. These constraints led to statistically significant misfit (Δχ2 = 191.193, 24, n = 25,512, P < 0.01), indicating measurement bias as a function of poverty status and educational attainment. Space constraints limit listing each set of differences here. However, Table 1 details the final results. Bolded values in the table highlight differences in the measurement parameters across the groups. For example, with respect to educational attainment, analyses revealed that at the same level of alcohol abuse, more highly educated as compared with less educated whites had a greater likelihood of endorsing that alcohol caused trouble with family/friends. As a converse example, more highly educated Hispanics were less likely to indicate that they entered harmful situations while drinking. Related to the direct effects of poverty and educational attainment, analyses also indicated that the direct effects of poverty status on the “fights” and “legal” items, as well as the direct effects of educational attainment on “drive while drinking” and “drive after drinking,” did not differ across whites and black/African-Americans. Model 2b relaxed the constraints causing misfit.

Model 3 modified model 2b to constrain the loadings to equivalence across groups. This model examined whether the items related similarly to alcohol abuse across whites, black/African-Americans, and Hispanics, after accounting for systematic measurement error due to poverty status and educational attainment. Constraining the loadings resulted in statistically significant misfit (Δχ2 = 30.40, 14, n = 25,512, P < 0.01) indicating bias as a function of race/ethnicity. Analyses indicated that 3 equality constraints led to the misfit. As Table 1 shows, responses to the items provided less reliable measurement for the minority groups (ie, smaller loading values for minorities). Analyses also indicated that black/African-Americans and Hispanics values differed significantly from each other. Model 3b relaxed these constraints.

Model 4 modified model 3b to constrain the thresholds to equality across whites, black/African-Americans, and Hispanics. This model examined whether affirmative item endorsements had similar likelihoods across race and ethnicity. Constraining the thresholds resulted in statistically significant misfit (Δχ2 = 88.87, 13, n = 25,512, P < 0.01), indicating measurement bias. Analyses showed that 8 equality constraints led to misfit. Table 1 summarizes these differences. But, as one example, compared with whites at the same alcohol abuse level, black/African-Americans as well as Hispanics were more likely to endorse drinking while driving. Finally, analyses also indicated that black/African-American and Hispanic threshold values for the drinking while driving, driving after drinking, and entering harmful situations did not differ from each other, though they did differ from those of whites. The final model relaxed these ill-fitting constraints.

Summarily, analyses revealed statistically significant measurement bias across race and ethnicity, even after accounting for bias due to differential poverty status and educational attainment. The final model incorporated numerous direct effects of poverty status and educational attainment on alcohol abuse measurement and incorporated several differences in alcohol abuse measurement across whites, black/African-Americans, and Hispanics.

Mitigating Systematic Measurement Error

The presence of significant measurement bias across individuals of different income, educational, and race/ethnic backgrounds indicates that one should not use unadjusted scores to measure alcohol abuse. Rather, one should use model-based estimates of alcohol abuse levels to mitigate systematic error. To demonstrate the importance of using model-based estimates that mitigate systematic measurement error, the model-based estimates that resulted from the final measurement model incorporating measurement differences was compared with estimates that resulted from a model ignoring systematic measurement error. Under the model ignoring measurement bias, Whites served as the reference group and had a mean of zero (for statistical identification). Both black/African-Americans and Hispanics had significantly greater mean alcohol abuse levels(MBlack/African-American=0.43:z=7.81;MHispanic=0.173:z=2.11). However, under the model mitigating systematic measurement error, black/African-Americans no longer differed significantly from whites (MBlack/African-American=0.136:z=1.86) and Hispanics had significantly lower alcohol abuse levels (MHispanic=−0.261:z=−2.19).


In this study, the importance of empirically evaluating systematic measurement error's present in CER is shown. Additionally, it was intended to demonstrate how systematic measurement error can influence analytic results and how model-based techniques can mitigate this error. The author also endeavored to briefly describe the mathematical and methodological tools used to conduct analyses probing for and correcting systematic measurement error. Using MG-MIMIC models, in the current example, the poverty status, educational attainment, and race and ethnicity all directly influencing alcohol abuse measurement was shown. Without accounting for systematic measurement error due to these sources, one would conclude that Hispanics and black/African-American demonstrate significantly greater amounts of alcohol abuse behavior than whites. However, after using model-based estimates of alcohol abuse that corrected for measurement bias and mitigated systematic measurement error, model-based estimates clarified that Hispanics demonstrate significantly lower amounts of alcohol abuse behavior in comparison to whites and that black/African-Americans do not differ significantly from whites in their alcohol abuse behavior. Without using model-based estimates, efforts to comparatively evaluate treatment effectiveness across these populations would result in flawed conclusions, especially in observational research.

These findings highlight that CER using self-reported measures in heterogeneous sample should consider whether group differences (or similarities) reflect true differences (or similarities) or whether group differences (or similarities) result from systematic measurement error. Additionally, these results indicate that systematic reviews based on research not accounting for systematic measurement error may reflect spurious findings. Until CER that uses self-report measures more effectively examines cross-group measurement reliability and validity equivalence for the measured variables, the validity of conclusions will remain clouded.

Despite this study's strengths, its limits deserve review. The Hispanic subgroups could not be explored given small subgroup sample sizes. This may miss additional heterogeneity. Second, a representative sample was used; it remains unclear whether these results would hold in a clinical sample. Finally, a demonstration within an actual CER study was not provided. Rather, the statistical method for mitigating bias was focused. The dataset used in this study did not lend itself to both examining bias and CER questions. While these concerns leave issues unaddressed, they do not impede the study's ability to serve as an example of the statistical method.

Finally, to extend the example given in the introduction to the current results, suppose the dataset which was used here included detailed information about different treatments that respondents had received, then, rather than using the biased observed scores, one would use the model-based scores that adjusted for measurement bias in analyses to select individuals and to compare differences in alcohol abuse levels across treatments and subgroups. In this way, one would achieve more accurate estimates of differences in alcohol abuse across the heterogeneous individuals included in the study.


Summarily, systematic measurement error (ie, measurement bias) may profoundly influence CER results. Despite equivalence in underlying values of a measured health construct people may respond dissimilarly to questions about their health as a function of various background variables. Without testing for measurement bias' presence, it will remain unclear whether treatment evaluations demonstrating treatment effectiveness (or failures) across a heterogeneous population reflect true differences or systematic measurement error. Importantly, MG-MIMIC models offer a tool to simultaneously investigate and mitigate systematic measurement error. Model-based estimates of the health construct corrected for systematic measurement error will lead to more valid treatment effectiveness comparisons across heterogeneous groups. Regardless of the specific fields in which investigators work, researchers should consider MG-MIMIC models as a tool to incorporate into the CER research tool box.


1. Muthén BO. Latent variable modeling in heterogeneous populations. Psychometrika. 1989;54:557–585.
2. Jones RN. Identification of measurement differences between English and Spanish language versions of the mini-mental state examination. Detecting differential item functioning using MIMIC modeling. Med Care. 2006;44:S124–S133.
3. Jones RN. Racial bias in the assessment of cognitive functioning of older adults. Aging Ment Health. 2003;7:83–102.
4. Bollen KA. Structural Equations With Latent Variables. New York, NY: John Wiley & Sons; 1989.
5. Muthén B. Contributions to factor analysis of dichotomous variables. Psychometrika. 1978;43:551–560.
6. Muthén B. A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika. 1984;49:115–132.
7. Grant BF, Kaplan K, Shepard J, et al. Source and Accuracy Statement for Wave 1 of the 2001–2002 National Epidemiologic Survey on Alcohol and Related Conditions. Bethesda, MD: National Institute on Alcohol Abuse and Alcoholism; 2003.
8. Embretson S, Reise SP. Item Response Theory for Psychologists. Hillsdale, NJ: Lawrence Erlbaum Associates; 2000.
9. American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders. 4th ed. Washington, DC: American Psychiatric Association; 1994.
10. Grant BF, Dawson DA, Hasin DS. The Alcohol use Disorder and Associated Disabilities Interview Schedule-DSM-IV Version (AUDADIS-IV). Bethesda, MD: National Institute on Alcohol Abuse and Alcoholism; 2001.
11. Grant BF. Convergent validity of DSM-III-R and DSM-IV alcohol dependence: results from the National Longitudinal Alcohol Epidemiologic Survey. J Subst Abuse. 1997;9:89–102.
12. Grant BF. Theoretical and observed subtypes of DSM-IV alcohol abuse and dependence in a general population sample. Drug Alcohol Depend. 2000;60:287–293.
13. Harford, Muthén BO. The dimensionality of alcohol abuse and dependence: a multivariate analysis of DSM-IV symptom items in the national longitudinal survey of youth. J Stud Alcohol. 2001;62:150–157.
14. Grant BF, Harford TC, Dawson DD, et al. The Alcohol Use Disorder and Associated Disabilities Interview Schedule (AUDADIS): reliability of alcohol and drug modules in a general population sample. Drug Alcohol Depend. 1995;39:37–44.
15. Hasin DS, Grant B, Cottler L. Nosological comparisons of alcohol and drug diagnoses: a multisite, multi-instrument international study. Drug Alcohol Depend. 1997;47:217–226.
16. Hasin D, Carpenter KM, McCloud S, et al. The Alcohol Use Disorder and Associated Disabilities Interview Schedule (AUDADIS): reliability of alcohol and drug modules in a clinical sample. Drug Alcohol Depend. 1997;44:133–141.
17. Hasin D, Paykin A. Alcohol dependence and abuse diagnoses: concurrent validity in a nationally representative sample. Alcohol Clin Exp Res. 1999;23:144–150.
18. Millsap RE, Yun-Tein J. Assessing factorial invariance in ordered-categorical measures. J Multivar Behav Res. 2004;39:479–515.
19. Hu L, Bentler PM. Fit indices in covariance structure modeling: sensitivity to underparameterized model misspecification. Psychol Methods. 1998;3:424–453.
20. Hu L, Bentler PM. Cutoff criteria for fit indexes in covariance structure analysis: conventional criteria versus new alternatives. Struct Equ Modeling. 1999;6:1–55.
21. Muthén LK, Muthén BO. Mplus User's Guide. 4th ed. Los Angeles, CA: Muthén & Muthén; 1998–2007.
22. Steiger JH. A note on multiple sample extensions of the RMSEA fit index. Struct Equ Modeling. 1998;5:411–419.
23. Korn E, Graubard BI. Estimating variance components by using survey data. J R Stat Soc Series B Stat Methodol. 2003;65:175–190.
24. Carle AC. Fitting multilevel models in complex survey data with design weights: recommendations. BMC Med Res Methodol. 2009;9:49.
25. Muthén BO, Hasin D, Wisnicki KS. Factor analysis of ICD-10 symptom items in the 1988 National Health Interview Survey on alcohol dependence. Addiction. 1993;88:1071–1077.
26. Muthén BO. Factor analysis of alcohol abuse and dependence symptom items in the 1988 National Health Interview Survey. Addiction. 1995;90:637–645.
27. Carle AC. Assessing the adequacy of self-reported alcohol abuse measurement across time and ethnicity: cross-cultural equivalence across Hispanics and Caucasians in 1992, non-equivalence in 2001–2002. BMC Public Health. 2009;9:60.

comparative effectiveness research; measurement bias; differential item functioning; cross-cultural differences

© 2010 Lippincott Williams & Wilkins, Inc.