Although technological advances have improved the wellbeing of our society, socioeconomically disadvantaged, ethnic minorities, and other underserved populations continue to experience a disproportionate share of adverse health outcomes.1–3 To rectify this health disparity,4–6 it is critical for the research and medical communities to leverage Big Data technology for understanding the causes of health disparities and guiding interventions.7 For example, recent studies on addressing the limited availability of social determinants (eg, race, ethnicity)8 in administrative/survey data7,9 include the statistical inference of race/ethnicity from surname and location,10 demonstrating the power of data analytics on health disparity research.
What has received less attention is the complex interplay between health disparity research and the privacy implications of collecting health care data containing sensitive diagnosis and demographics information. Although precautions like removing personally identifiable attributes are often taken, privacy leakage happens even after such removals.11,12 This raises the question of how to study health disparities while protecting the privacy of individuals involved.
Privacy concerns over health care data have long been recognized: a meta-analysis13 identified privacy as a key concern brought by Big Data technologies in health care. Numerous solutions have been developed, informed consent is popular in practice, yet not as effective when the data are subject to secondary use,14 like health disparity research. Technically, there have been extensive work on privacy-preserving data publishing.15,16 Although directly using general-purpose privacy-protection tools appears straightforward, 2 special characteristics of health disparity research call for special treatment:
- Privacy disparity: the specific vulnerability of underprivileged population to privacy intrusions. A critical consequence of privacy disclosure is collateral damage: an individual held liable not for his/her own behavior but those who are “similar” on social determinants.17 This harm is more pronounced for underprivileged groups, as 52% Americans in the lowest earning group (household income under $20,000) are “very concerned” about the collection of personal information, compared with just 37% in the highest group (over $100,000).18
- Privacy protection masking inequality: privacy protection hides the presence of individuals, thereby increases the risk of masking inequalities: for example, to achieve ð-differential privacy,16 a state-of-the-art standard, we automatically attain kðdifferential privacy for any k individuals. That is, if we seek strong privacy protection (a small ð=0.001) when there is a small underprivileged subpopulation (k=100), then even the presence of this subpopulation becomes undetectable from privacy-preserved data, masking potential disparities.
SCOPE OF PAPER
This paper demonstrates the existence of privacy disparity in health care data, and examines the effectiveness of state-of-the-practice/art tools on protecting privacy and enabling health disparity research. It does not aim to solve all challenges facing privacy in health disparity research: the design of privacy-preserving tools can be extremely complex, demanding specific designs for specific use cases.19 Instead, our goal is to illustrate the existence of key challenges, and call for future technical research and policy discussions.
We used the Pennsylvania Inpatient (PA) Dataset (486,924 records, 82 attributes) for Q1 2001.20 Personally identifiable information has been removed, but demographics like sex, age, race, ZIP code remain. Other fields include admission, treatment, and discharge information.
We chose the PA dataset because of its representativeness: the demographics attributes included in the dataset are widely prevalent in other datasets. Among 20 state-level inpatient databases maintained by Agency for Healthcare Research and Quality (AHRQ), all 20 include sex, 19 include age, 18 include race, and 17 include ZIP code.21 All 4 are included in AHRQ nationwide databases such as National Inpatient Sample (NIS)22 and Kids’ Inpatient Database (KID).23
It is also important to note that the results in this article do not depend on the specific composition of the PA dataset: none of the 4 demographics attributes (or even their combination) can guarantee the unique identification of an individual. Instead, each of these (and many other) attributes discloses partial identity information that causes privacy concerns. Thus, the results of this paper readily extend to datasets with different coverage of demographics and/or health-related attributes.
State-of-the-Practice Anonymization (Texas Procedure)
When releasing the Texas Inpatient Data, a dataset structurally similar to PA dataset, Texas Health Care Information Collection (THCIC) applied anonymization rules,24 also applicable to PA dataset:
- Suppress last 2 digits of ZIP with <30 patients.
- Suppress ZIP, facility-type indicators, and provider-ID if hospital has <50 discharges.
- Suppress ZIP and provider-ID if hospital has <5 discharges of a sex.
- Change Race to ‘Other’ and suppress ethnicity if hospital has <10 discharges of a race.
Although these rules were specific proposed by THCIC, their design principles are shared across anonymization attempts in public health practices: Department of Health and Human Services recommended the suppression of the last 2 digits of ZIP code as one of the measures for complying with HIPAA.25 New York State applies the same ZIP code suppression for HIV records starting in 2008.26
State-of-the-Art Differential Privacy Techniques
Depending on how privacy-preserved data are used, many different designs exist for differential privacy.16,27,28 We considered as an example χ2 independence test, and adopted MCIndepD,19 a state-of-the-art differential privacy algorithm for this test.
Description of Methods
To study privacy disparity, we examined whether releasing PA dataset levies different privacy risks against different races. An important issue here is how to measure privacy risk. Although numerous privacy measures have been proposed, no universal measure exists because the potential linkage between an administrative dataset and many extraneous (eg, social media) datasets makes it difficult to quantify privacy risk without understanding all linkage channels. Although modern techniques like differential privacy can be robust against extraneous linkage, they do not measure privacy disclosure of raw data. For the purpose of this article, we adopted 2 simple measures to study privacy disparity:
- Percent uniqueness (PU): percentage of records with a unique value combination on the attributes of interest.15 A higher PU naturally indicates higher privacy risk.
- Information entropy (IE): IE29 of the distribution of attributes of interest. Intuitively, IE measures the number of bits required to “describe” the attributes. The lower IE is, the less uncertainty there is for attribute values, the higher privacy risk becomes.
We measured PU and IE for 2 types of attributes: (1) medical-related ones like diagnosis codes; and (2) demographics ones like age and sex; and compared each across racial groups.
Our metrics are more generic than other existing ones, which are specific either to the technique used for privacy preservation,30 or to a predetermined baseline distribution for sensitive data.31 As a result, our metrics can be applied to both original and anonymized data, fitting the need of our research.
Examination of Anonymization
We then examined the effect of Texas procedure on PA dataset, specifically how it affects utility for health disparity research. To capture utility reduction, we considered 2 measures:
- Number of suppressions: count of individuals with attribute values suppressed.
- Error incurred to analytical query answers: we considered as a sample query the frequency distribution of principal diagnosis code (P_DIAG), and used the Kullback-Leibler (KL)-divergence32 to measure the distance between the raw distribution from PA data with that from the anonymized data. Intuitively, KL-divergence KL(P||Q) captures the number of extra bits one needs to depict a distribution P after learning Q. The smaller it is, the less impact anonymization has on data utility.
Examination of Differential Privacy
Our goal was to test whether, after achieving differential privacy, the dataset still supports statistics used in health disparity research. We applied MCIndepD19 over PA dataset, and studied how it affects the outcome of an independence test between “Race” and “MediQual Atlas Midstay Nonresponder Indicator” (whether a patient is responding to treatment: Yes-No-Ineligible). To study how differential privacy could mask inequalities, we considered subsets of PA dataset, specifically: (1) all records with race=Asian or white, (2) African American or white, and (3) Native American or white. Note that running independence test over original data, including all 3 subsets, rejects the null hypothesis with high significance (P<0.00001). In setting the goal of (ε,δ)-differential privacy for MCIndepD, we kept δ=10−6 as,19 and varied ε from 1 to 0.01.
Table 1 depicts the number of records for each race and for principal diagnosis, age, and ZIP code-PU, IE, and the number of unique values in each race. Privacy disparity is readily observable, as both PU and IE suggest disproportional privacy risks for Asians and Native Americans.
Examination of Anonymization
Table 2 shows the results of anonymization. Although the suppression percentage for ZIP is relatively uniform, the suppression of race disproportionately affects Asians (11.08%). Although anonymization had little effect on whites (with KL-divergence ~10−6 bit), its effect on Native Americans is significant (>2.5 bits). Given the entropy of P_DIAG is only ~8 bits, Texas procedure causes substantial utility loss for Asians and Native Americans.
Examination of Differential Privacy
Table 3 depicts the average P-value over 10 runs. Observe that, to achieve reasonable privacy (ε≤0.1), one cannot reject the null hypothesis for the subset of Asians and whites. Even for the entire dataset, when ε=0.01, the average P-value (0.009) makes the rejection questionable. This demonstrates differential privacy masks disparities that would otherwise be obvious.
DISCUSSIONS AND POLICY IMPLICATIONS
We illustrated the challenge facing the balancing between privacy and health disparity research. Our results identified privacy disparity in public health data, and found anonymization and differential privacy techniques may disproportionately suppress information for minorities or mask health disparities. Although the specific results were generated from a state-level dataset, the issues unveiled extend to datasets larger (eg, nationwide databases) or smaller (eg, research datasets), even though their achievable tradeoffs between privacy and utility differ. For example, while a larger dataset naturally retains a higher statistical power after anonymization, the retained power still differs between various subpopulations, making signals detectable for some undetectable for other, minority, populations causing disparity concerns.
Our findings have implications on both technical and policy research. Technically, instead of offering a one-size-fits-all privacy guarantee to all individuals for all purposes, the research community need to develop more flexible solutions to specifically enable health disparity research. One such possibility is to customize the privacy-preservation process to fit the utilization patterns of health disparity research, for example, in the case of differential privacy, by taking into account the importance of a statistic for disparity research when allocating the privacy budget ε.16
Policywise, regulators must carefully study the privacy implications before mandating the inclusion of attributes like race and geolocation in a released dataset. Although doing so could help identify health disparities, it may also have the danger of introducing privacy disparities, causing further harm to minorities. In contrast, mandating privacy could be problematic too, as the direct usage of off-the-shelf anonymization tools may mask disparities that are otherwise identifiable. Like in the technical case, it may be important to recognize the special needs of health disparity research and its inherent tension with privacy, and therefore, allow certain practical tradeoffs instead of mandating one over another. An example here is to allow certain patients to specifically consent to the usage of their data for health disparity research, while subjecting all other usages to more stringent privacy requirements. Although doing so introduces new challenges, for example, on ensuring the specially permitted data are not used for any other purpose, it provides a potentially effective solution to address the fundamental tension between enabling health disparity research and addressing privacy disparity.
Enabling health disparity research while avoiding privacy disparity is a complex challenge. It is our belief that addressing it requires extensive efforts in technical research and policy discussions, and our hope that this paper serves as a call for attention to the stakeholders and researchers.
1. Jha AK, Fisher ES, Li Z, et al. Racial trends in the use of major procedures among the elderly. N Engl J Med. 2005;353:683–691.
2. Trivedi AN, Zaslavsky AM, Schneider EC, et al. Trends in the quality of care and racial disparities in Medicare managed care. N Engl J Med. 2005;353:692–700.
3. Agency for Healthcare Research and Quality. 2016 National Healthcare Quality and Disparities Report. Rockville, MD. 2017. AHRQ Pub. No. 17-0001.
4. Nickens H. Report of the secretary’s task force on black and minority health: a summary and a presentation of health data with regard to blacks. J Natl Med Assoc. 1986;78:577–580.
5. Whitehead M. The concepts and principles of equity and health. Health Promotion International. 1991;6:217–228.
6. Braveman P, Gruskin S. Defining equity in health. J Epidemiol Community Health. 2003;57:254–258.
7. Zhang X, Pérez-Stable EJ, Bourne PE, et al. Big data science: opportunities and challenges to address minority health and health disparities in the 21st century. Ethn Dis. 2017;27:95–106.
8. Bilheimer LT, Klein RJ. Data and measurement issues in the analysis of health disparities. Health Serv Res. 2010;45(p2):1489–1507.
9. Agency for Healthcare Research and Quality. 2003 National Healthcare Quality and Disparities Report. Rockville, MD: Agency for Healthcare Research Quality; 2003.
10. Elliott MN, Fremont A, Morrison PA, et al. A new method for estimating race/ethnicity and associated disparities where administrative records lack self‐reported race/ethnicity. Health Serv Res. 2008;43(p1):1722–1736.
11. Sweeney L. k-anonymity: a model for protecting privacy
. Int J Uncertainty, Fuzziness Knowl Based Syst. 2002;10:557–570.
12. Homer N, Szelinger S, Redman M, et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008;4:e1000167.
13. Mittelstadt BD, Floridi LMittelstadt BD, Floridi L. The ethics of big data: current and foreseeable issues in biomedical contexts. The Ethics of Biomedical Big Data. Cham, Switzerland: Springer International Publishing; 2016:445–480.
14. Manson NC, O’Neill O. Rethinking Informed Consent in Bioethics. Cambridge, UK: Cambridge University Press; 2007.
15. O’Neill L, Dexter F, Zhang N. The risks to patient privacy
from publishing data from clinical anesthesia studies. Anesth Analg. 2016;122:2017–2027.
16. Dwork CAgrawal M, Du D, Duan Z, Li A. Differential privacy
: a survey of results. International Conference on Theory and Applications of Models of Computation. Berlin, Heidelberg: Springer; 2008:1–19.
17. Madden M, Gilman M, Levy K, et al. Privacy
, poverty, and big data: a matrix of vulnerabilities for poor Americans. Wash UL Rev. 2017;95:53–125.
19. Gaboardi M, Lim HW, Rogers RM, et al. Differentially private chi-squared hypothesis testing: Goodness of fit and independence testing. ICML'16 Proceedings of the 33rd International Conference on International Conference on Machine Learning-Volume 48. JMLR. 2016.
20. Pennsylvania Health Care Cost Containment Council, Pennsylvania Inpatient Data, Reporting Number 2001-10/01-04, 2001.
27. Dwork C, Roth A. The algorithmic foundations of differential privacy
. Found Trends Theor Comp Sci. 2014;9:211–407.
28. Hardt M, Ligett K, McSherry FBartlett PL. A simple and practical algorithm for differentially private data release. Advances in Neural Information Processing Systems. Lake Tahoe, NV: Neural Information Processing Systems Foundation Inc.; 2012:2339–2347.
29. Cover TM, Thomas JA. Elements of Information Theory. New York, NY: John Wiley & Sons; 2012.
30. Rizvi SJ, Haritsa JR. Maintaining data privacy
in association rule mining. Proceedings of the 28th International Conference on Very Large Data Bases. VLDB Endowment. 2002:682–693.
31. Li N, Li T, Venkatasubramanian S. Closeness: a new privacy
measure for data publishing. IEEE Trans Knowl Data Eng. 2010;22:943–956.
32. Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22:79–86.