Researchers have become increasingly interested in information with regard to race/ethnicity. As with other demographic characteristics, this information allows researchers to examine how this social construct may impact the health and well-being of survey respondents when conducting studies focused on reducing racial and ethnic disparities in health.1–10 Although many surveys ask respondents to self-report their race and ethnicity, some respondents skip these items, either because they specifically do not wish to disclose their race/ethnicity or because of breakoff (ie, the respondent stops filling out the survey before reaching the end).
Researchers often exclude these observations from analyses, but the exclusion of these cases can reduce statistical power and might bias results. In particular, since members of some racial and ethnic groups may be less likely than others to self-report race/ethnicity,11 the representation of certain groups in the data set may be compromised. These problems are most acute when rates of missing data on race/ethnicity are high, as they have been for some survey efforts. For example, the 2010 Census and 2001–2009 American Community Survey were missing race data in 18%–22% of cases, and were missing data on Hispanic origin in 7%–9% of cases.12
An alternative to excluding observations with missing self-reported data on race/ethnicity is to impute race/ethnicity. We investigated the application of a validated method (the Medicare Bayesian Improved Surname and Geocoding method, or MBISG5) for indirectly estimating race/ethnicity where self-reported data were missing in a data set of Medicare beneficiaries’ health care experiences, the Consumer Assessment of Healthcare Providers and Systems (CAHPS) survey. The MBISG 1.0 method has been used to indirectly estimate race/ethnicity for Healthcare Effectiveness Data and Information Set (HEDIS) records on Medicare beneficiaries and to measure Medicare advantage (MA) contract-level HEDIS performance for specific racial/ethnic groups.5 Currently, the MBISG 2.0 method, described by Haas et al,13 is being used to produce national and contract-level stratified reports of quality measures by the Centers for Medicare & Medicaid Services (CMS).14 Here we apply the MBISG 2.0 method to investigate patterns of missing self-reported race/ethnicity in the Medicare CAHPS surveys.
We sought to answer the following questions:
- How often is self-reported race/ethnicity missing in the Medicare CAHPS data set?
- Are some racial/ethnic groups more likely than others to skip survey items asking about race/ethnicity?
- To what extent do cases of missing self-reported race/ethnicity appear to reflect a specific reluctance to disclose one’s race/ethnicity versus survey breakoff or other general factors?
- How much does imputing race/ethnicity in cases with missing self-reported race/ethnicity affect the estimated overall distribution of respondents by race/ethnicity?
- How much does imputing race/ethnicity affect statistical power and sample size?
To address these questions, we pooled 2013–2014 MA and fee for service CAHPS data. The CMS conducts the Medicare CAHPS surveys annually to collect, analyze, and report data on beneficiaries’ experiences with their health care services.15 The Medicare CAHPS surveys are mail surveys with telephone follow-up of mail nonrespondents based on a stratified random sample of Medicare beneficiaries, with states as strata for fee for service beneficiaries and contracts as strata for MA (managed care) beneficiaries. Response rates were 45% in 2013 and 41% in 2014.
Person-level survey weights were used for all models and estimates to account for sample design and nonresponse. Survey weights were developed by raking the sample to match the marginal distribution of the enrolled population in each county by contract combination on sex, age, race/ethnicity, Medicaid eligibility/low-income subsidy enrollment status, Special Needs Plan enrollment, and zip-code–level distributions of income, education, and race/ethnicity.16,17
The surveys ask beneficiaries about their experiences of health care across several dimensions.18 For instance, there are questions that ask the respondent how easy it was to receive needed care, how quickly he or she received care, how often his or her health plan’s customer service provided useful information, how often his or her doctor explained things in a way that was easy to understand, and several other questions pertaining to experience of care.
Following the federal standards19 on the collection of race and ethnicity data, the Medicare CAHPS survey uses a 2-question format with ethnicity collected first. The first question asks beneficiaries whether they are of Hispanic or Latino origin or descent. The second question asks their race, with response options of “white,” “Black or African American,” “Asian,” “Native Hawaiian or other Pacific Islander,” and “American Indian or Alaska Native” and the opportunity to mark ≥1 responses. These questions appear in the final section of the survey, “About You,” which includes questions about other demographic and health-related characteristics.
If a beneficiary self-identified as Hispanic, then they were classified as Hispanic regardless of any races endorsed. Non-Hispanic beneficiaries who marked no races other than “Asian” or “Native Hawaiian or other Pacific Islander” were classified as Asian/Pacific Islander (API). The remaining non-Hispanic beneficiaries who endorsed exactly one race were classified as American Indian or Alaskan Native (AI/AN), Black, or white. If a beneficiary endorsed ≥2 races (except “Asian” and “Native Hawaiian or other Pacific Islander”), we classified the beneficiary as multiracial. Thus, we had 6 mutually exclusive racial/ethnic categories: non-Hispanic white (hereafter “white”), Black, Hispanic, API, AI/AN, and multiracial.
To indirectly estimate the race/ethnicity of respondents who did not self-report their race/ethnicity, we applied the MBISG 2.0 method, which has replaced the previous MBISG 1.0.5,20,21 For Medicare beneficiaries, CMS has administrative race/ethnicity data based on Social Security Administration (SSA) records. Although these SSA-based data are suitable for identifying Black beneficiaries, they do not perform well in identifying Hispanic or API beneficiaries.22
To address this limitation, MBISG 2.0 combines CMS administrative variables and Census data to assign each person a set of probabilities of falling into each of the 6 racial/ethnic groups used in our analysis. MBISG 2.013 estimates these probabilities from a multinomial logistic regression of self-reported race/ethnicity on all predictors in a large, nationally representative data set weighted to represent the population of Medicare beneficiaries. The MBISG 2.0 estimates are the predicted probabilities of falling into each of the 6 racial/ethnic groups from this model, with each beneficiary’s set of 6 probabilities summing to one. MBISG 2.0 racial/ethnic estimates can be used as predictors in regression models in the same way that a set of indicators would be, with one group (usually the largest group, white) omitted as the reference category. This approach generally results in more accurate estimates of differences between groups than converting the probabilities into a categorical variable, for instance, by assigning each person to the group with the highest probability in that person’s set.23 Because MBISG 2.0 is not recommended for inference with regard to the AI/AN and multiracial groups,13 we focus on results from the 4 largest racial/ethnic groups: white, Black, Hispanic, and API.
We report the “area under the curve” (AUC), also known as the C-index or concordance statistic,24 to summarize the MBISG 2.0’s predictive accuracy for self-reported race/ethnicity. An AUC value of 0.5 indicates chance, or that the MBISG 2.0 probability is unrelated to true race/ethnicity, while a value of 1 indicates perfect performance.
We imputed self-reported race/ethnicity using the MBISG 2.0 probabilities as described in Elliott et al.22 When race/ethnicity was not self-reported, a respondent received their imputed set of 6 non-zero MBISG 2.0 probabilities, which sum to 1. When race/ethnicity was self-reported, the respondent received a value of 1 for the category in which they self-reported and values of 0 in all other categories. These 6 imputed racial/ethnic variables were used in univariate analyses to estimate the distribution of race/ethnicity and as predictors in regression models to estimate adjusted differences between groups; both univariate analyses and regression models were weighted with the survey analytic weight. To estimate the population proportion falling into a category (such as not self-reporting race/ethnicity or breaking off from the survey) for each racial/ethnic group, we weighed all survey respondents to resemble the Medicare population of each racial/ethnic group in turn. To do this, we created a series of product weights by multiplying the survey analytic weight for the respondent by the imputed race/ethnicity variable for each racial/ethnic group for the same respondent. These 6 product weights account for both the probability of inclusion of the respondent from the Medicare population (from the survey analytic weights) and the probability of being a member of a specified group (from the imputed racial/ethnic variables). Thus, someone reporting being in a specific racial/ethnic group got the full survey weight for that group and a weight of 0 for the other racial/ethnic groups. The survey weight of someone not reporting race/ethnicity was divided among the racial/ethnic groups proportionally to the MBISG 2.0 probabilities. These group-specific product weights were then used in separate analyses to estimate proportions for each group.
Among the 21,088 of 508,597 survey respondents in the data set who did not report their race/ethnicity, we estimated the number who are in each racial/ethnic group as the sum of each of the 6 racial/ethnic probabilities. We used the product weights to estimate the percentage of each racial/ethnic group who did not self-report their race/ethnicity, as described above. In addition, we explored 2 factors potentially associated with not answering the race/ethnicity items for reasons other than reluctance to disclose race/ethnicity: survey breakoff and general patterns of item missingness not due to break off. We defined survey breakoff as failure to complete any items in the final section of the survey, “About You,” which includes race/ethnicity and other demographic items. We report the rates of survey breakoff for each racial/ethnic group. We also report rates of race/ethnicity item missingness among nonbreakoff cases; however, not answering these items may reflect not a specific desire to avoid disclosing this information, but rather a general higher item nonresponse rate that would affect other items in the demographic section. Therefore, we also report rates of missingness among nonbreakoff cases of 2 other demographic items, age and sex. These 2 variables may be less sensitive to disclose than race/ethnicity, and because they are also administratively available to CMS, there would be less motivation to withhold this information. Age and sex were included on the 2013 but not the 2014 survey, limiting the missingness analysis for age and sex to 2013 data.
We tested whether item nonresponse and breakoff rates differed by race/ethnicity in a series of linear mixed-effects regression models predicting the probability of item missingness or breakoff from imputed race/ethnicity. The dependent variable was binary (1 if the respondent did not answer the item or broke off from the survey and 0 otherwise). The independent variables were the imputed racial/ethnic variables described above, with white omitted as the reference group, and an indicator for survey year (except when predicting 2013-only missingness of age and sex). The model also incorporated a random effect for MA contracts, since contracts and associated survey vendors may have different rates of item missingness and survey breakoff. Only nonbreakoff cases were included in the model for analyses comparing item missingness rates among nonbreakoff cases.
We calculated the overall racial/ethnic distribution before and after imputation of race/ethnicity for item nonrespondents. To quantify how much imputation affects statistical power, we calculated the effective sample size (ESS) for each racial/ethnic group before and after imputation.25 Before imputation, survey respondents who did not report their race/ethnicity were not included in the calculation for any group. After imputation, the racial/ethnic-specific weights, formed as the product of the imputed racial/ethnic variables and the analytic survey weights, were used to calculate ESSs for each racial/ethnic group among those reporting being in that group and those not reporting race/ethnicity.
In the 2013–2014 Medicare CAHPS data, there were a total of 508,497 beneficiaries for whom we could impute race/ethnicity. Among cases self-reporting race/ethnicity, the AUC for MBISG 2.0 was 96% for whites, 99% for Blacks, 95% for Hispanics, and 98% for API.
Self-reported race/ethnicity was missing in 3.7% of all cases. The rate of missingness for race/ethnicity is similar to other nearby items within the “About You” section of the survey, for instance, highest education level (3.5%), number of people living in the household (3.1%), age (3.5%), and sex (3.3%). We investigated whether the cases in which respondents do not self-report race/ethnicity are randomly distributed with respect to estimated race/ethnicity. Our analysis suggests this is not the case (Table 1). We estimate that Black beneficiaries are most likely to not self-report race/ethnicity, followed by Hispanic and API beneficiaries (P<0.001 for a test of each group against whites). We estimate that 6.6% of Blacks, 4.7% of Hispanics, and 4.7% of API beneficiaries do not self-report race/ethnicity. In contrast, we estimate that only 3.2% of whites do not self-report race/ethnicity.
Column 4 in Table 1 shows that 1.2% of respondents break off from the survey before the “About You” section. Breakoff rates vary by race/ethnicity, with Hispanic, Black, and API beneficiaries having somewhat higher rates of breakoff (2.2%, 2.1%, and 1.5%, respectively) than whites (1.0%) (P<0.01). Among survey respondents who do not self-report race/ethnicity, only 33.7% are breakoff cases; that is, only about 1/3 of missingness in self-reported race/ethnicity is due to survey breakoff.
Columns 5–7 of Table 1 report item missingness rates among nonbreakoff cases for race/ethnicity, age, and sex, respectively. Overall, 2.5% of nonbreakoff cases do not report race/ethnicity, with moderately to slightly higher item skip rates among Black (4.5%), API (3.2%), and Hispanic (2.6%) beneficiaries than among white beneficiaries (2.2%) (P<0.001 for each comparison against white). Thus, most of the overall differences between racial/ethnic groups in race/ethnicity item skip rates is due to skips among nonbreakoff cases rather than to survey breakoff.
Item missingness rates of age and sex among nonbreakoff cases show a similar pattern by race/ethnicity: Black, Hispanic, and API beneficiaries are more likely than white beneficiaries to not answer the age and sex items even when they complete some demographic items (P<0.001 for each comparison against white). Among nonbreakoff cases, Hispanics had a 0.4 percentage point higher item missingness rate for race/ethnicity than whites; this difference is slightly smaller than the difference for the age and sex items, for which Hispanics had a 0.6 and 0.9 percentage point higher item missingness rate than whites, respectively. API beneficiaries had a 1.0 percentage point higher item missingness rate for race/ethnicity than whites; this is similar to differences in item missingness rates between API and whites for age and sex (0.6 and 0.8 percentage point, respectively). In contrast, the item missingness rate for race/ethnicity was 2.3 percentage points higher for Black beneficiaries than for whites, higher than the difference in item missingness for age or sex, for which there was a 0.9 and 0.7 percentage point gap, respectively.
Taken together, the results in Table 1 indicate that rather than representing beneficiaries’ reluctance to disclose their race/ethnicity, missingness of the race/ethnicity items primarily appears to reflect breakoff and other general factors with regard to item nonresponse, especially for Hispanic and API beneficiaries. We cannot rule out the possibility that some of the relatively high race/ethnicity item skip rate among Black nonbreakoff respondents is due to reluctance to disclose race/ethnicity, but if it is, it is likely that racial/ethnic-specific factors represent only a minority of the overall difference in skip rates compared with whites. In general, item missingness is higher for those respondents with lower education and older age as well as for racial/ethnic minorities.11
Table 2 shows the extent to which adding cases with imputed race/ethnicity affected our estimated overall distribution of respondents and ESS by race/ethnicity. Column 4 provides the overall racial/ethnic distribution based only on those who self-reported their race/ethnicity, and column 5 shows what this distribution looks like after including those respondents who did not self-report their race/ethnicity using their estimated race/ethnicity. Imputation had a minimal effect on our overall estimated distribution of respondents by race/ethnicity. The largest increase was observed for Blacks; the estimated percentage of responses from Black beneficiaries increased from 9.1% to 9.4%. The estimated percentage of responses from Hispanic and API beneficiaries increased slightly from 7.8% to 7.9% and from 2.8% to 2.9%, respectively. The estimated proportion of responses from White beneficiaries declined slightly from 77.7% to 77.3%.
Including respondents who did not self-report race/ethnicity in statistical models has the potential to increase sample sizes and thus improve statistical power. Columns 1–3 show the total number of respondents self-reporting race/ethnicity, the ESS based only on self-report, and ESS where race/ethnicity has been imputed for cases who did not self-report these items. The final column shows the relative increase in ESS through imputation. Indirectly estimating race/ethnicity for those respondents who did not self-report race/ethnicity results in a relative increase in the ESS for Blacks by 8.6%, a relative increase in the sample size for Hispanics by 6.7%, and a relative increase in the sample size for API beneficiaries by 6.5% (Table 2, final column). CMS reports scores for seven CAHPS measures stratified by race/ethnicity within MA contracts,14 requiring a minimum of 100 item completes for a given racial/ethnic group to be reportable. Adding imputed race/ethnicity would result in 42 additional estimates being reportable for Black beneficiaries, 22 for Hispanic, and 5 for API, an increase of 4%–8% in the number of contracts reportable (data not shown).
Our study suggests that it may be worthwhile to impute race/ethnicity for cases in which data on race/ethnicity are unavailable in survey data sets due to item nonresponse. The concern that members of some racial/ethnic groups are more likely than those in other groups to skip questions asking about race and ethnicity is borne out by our results. We found that Black, Hispanic, and API Medicare beneficiaries were more likely than White beneficiaries to skip these items. Thus, using a validated method to impute race/ethnicity (such as the MBISG 2.0 method used in this study) can result in better representation of racial/ethnic minorities in survey datasets. In a recent application of the Bayesian Improved First Name Surname Geocoding method to mortgage applicants, Voicu26 found that Blacks were less likely than whites to self-report race/ethnicity on loan applications.
Our results do not suggest that respondents who skipped these questions did so primarily out of reluctance to disclose their race/ethnicity. The low rate of missingness (3.7% of all cases) in our data set and its comparability to rates for other adjacent demographic items suggest that the reasons for not reporting race/ethnicity may not be specific to the content of the items. Rather, different rates of missing data on race/ethnicity mainly appear to have been the result of different general patterns of item missingness, including both breakoff and item missingness among nonbreakoff cases, consistent with general patterns on item missingness.11 Survey breakoff primarily occurs during the telephone follow-up mode of the Medicare CAHPS surveys, a mode which contains especially high proportions of completes from Black and Hispanic respondents.11
However, there is some evidence that a small amount of the higher item missingness of the race/ethnicity items among Blacks may be specific to race/ethnicity and thus might reflect reluctance to disclose race/ethnicity. We also cannot rule out the possibility that some respondents skip the entire demographic section because of a reluctance to disclose their race/ethnicity. However, respondents may also skip this section of the survey due to a reluctance to disclose other demographic characteristics.
We did not find that adding imputed cases of race/ethnicity to the data set substantially altered the estimated overall distribution of respondents by race/ethnicity. It did result in a small increase in ESS (both overall and for individual racial/ethnic groups) and statistical power. For example, the number of contracts for which estimates specific to Black, Hispanic, and API members would be possible would increase by 4%–8%. Our CAHPS data set contained missing data on race/ethnicity in only 3.7% of cases. To the extent that other survey data sets have higher rates of missingness, the impact of imputation on the overall distribution of respondents by race/ethnicity may be greater, and the gains in sample size and statistical power may also be greater.
Because no 2 data sets are alike, a replication of our analysis using a different data set may yield different results. Our data were restricted to Medicare beneficiaries and focused on their health care experiences. This population consists of seniors and younger beneficiaries with disabilities. Other populations may be more or less likely to skip questions in general, or may be more reluctant to disclose their race/ethnicity in particular. More research is needed to determine the degree to which racial/ethnic minorities are more likely than whites to skip items asking about race and ethnicity in other surveys.
The authors would like to thank Biayna Darabidian, BA for assistance with manuscript preparation.
1. Coker TR, Elliott MN, Toomey SL, et al. Racial and ethnic disparities in ADHD diagnosis and treatment. Pediatrics. 2016;138:1–9.
2. Institute of Medicine. Unequal Treatment: Understanding Racial and Ethnic Disparities in Health Care. Washington, DC: National Academies Press; 2002.
3. Jung DH, Palta M, Smith M, et al. Differences in receipt of three Preventive Health Care Services by race/ethnicity
in Medicare advantage plans: tracking the impact of pay for performance, 2010 and 2013. Prev Chronic Dis. 2016;13:E125.
4. Marrast L, Himmelstein DU, Woolhandler S. Racial and ethnic disparities in mental health care for children and young adults: a national study. Int J Health Serv. 2016;46:810–824.
5. Martino SC, Weinick RM, Kanouse DE, et al. Reporting CAHPS and HEDIS data by race/ethnicity
for Medicare beneficiaries. Health Serv Res. 2013;48:417–434.
6. Martino SC, Elliott MN, Hambarsoomian K, et al. Racial/ethnic disparities in medicare beneficiaries’ care coordination experiences. Med Care. 2016;54:765–771.
7. Mejia de Grubb MC, Salemi JL, Gonzalez SJ, et al. Trends and correlates of disparities in alcohol-related mortality between hispanics and non-Hispanic whites in the United States, 1999 to 2014. Alcohol Clin Exp Res. 2016;40:2169–2179.
8. National Center for Education Statistics. The Nation’s Report Card: Science 2011, National Assessment of Educational Progress at Grade 8. US Department of Education; 2012.
9. Quinn DM, Cooc N. Science achievement gaps by gender and race/ethnicity
in elementary and middle school. Educ Res. 2015;44:336–346.
10. Williams JS, Walker RJ, Egede LE. Achieving equity in an evolving healthcare system: opportunities and challenges. Am J Med Sci. 2016;351:33–43.
11. Klein DJ, Elliott MN, Haviland AM, et al. Understanding nonresponse to the 2007 Medicare CAHPS survey. Gerontologist. 2011;51:843–855.
12. Fernandez LE, Rastogi S, Ennis SR, et al. Evaluating race and hispanic origin responses of medicaid participants using census data. CARRA Working Paper Series. 2015;1:1–29.
13. Haas A, Elliott MN, Dembosky JW, et al. Imputation of race/ethnicity
to enable measurement of HEDIS performance by race/ethnicity
. Health Serv Res
. In Press.
16. Purcell NJ, Kish L. Postcensal estimates for local areas (Or Domains). Int Stat Rev. 1980;48:3–18.
17. Deville J-C, Sarndal C-E, Sautory O. Generalized raking procedures in survey sampling. J Am Stat Assoc. 1993;88:1013–1020.
18. Centers for Medicare and Medicaid Services. Medicare advantage and prescription drug plan CAHPS survey. 2018. Available at: https://www.ma-pdpcahps.org/
. Accessed March 2, 2018.
19. Office of Management and BudgetBudget Ooma. Revisions to the standards for the classification of federal data on race and ethnicity. Office of Management and Budget. Washington, DC; 1997.
20. Morales LS, Elliott MN, Weech-Maldonado R, et al. Differences in CAHPS adult survey reports and ratings by race and ethnicity: an analysis of the National CAHPS benchmarking data 1.0. Health Serv Res. 2001;36:595–617.
21. Elliott MN, Fremont A, Morrison PA, et al. A new method for estimating race/ethnicity
and associated disparities where administrative records lack self-reported race/ethnicity
. Health Serv Res. 2008;43:1722–1736.
22. Elliott MN, Morrison PA, Fremont A, et al. Using the census bureau’s surname list to improve estimates of race/ethnicity
and associated disparities. Health Serv Outcomes Res Methodol. 2009;9:69–83.
23. McCaffrey DF, Elliott MN. Power of tests for a dichotomous independent variable measured with error. Health Serv Res. 2008;43:1085–1101.
24. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143:29–36.
25. Gabler S, Haedar S, Lahiri P. A model-based justification of Kish’s formula for design effects for weighting and clustering. Surv Method. 1999;25:105–106.
26. Voicu I. Using first name information to improve race and ethnicity classification. Stat Public Policy. 2018;5:1–13.