In a recent editorial, the editors of Epidemiology invited authors to make their analytic and simulation codes, questionnaires, and data used in analyses available to other researchers.1 Much has been written about the need for data sharing and reproducible research, and many journals and funding agencies have explicit data-sharing policies.2,3 When data are confidential, however, investigators typically cannot release them as collected, because doing so could reveal data subjects' identities or values of sensitive attributes, thereby violating ethical and potentially legal obligations to protect confidentiality.
At first glance, safely sharing confidential data seems a straightforward task: simply strip unique identifiers such as names, addresses, and identification numbers before releasing the data. However, these actions alone may not suffice when other identifying variables, such as geographic or demographic data, remain in the file. These quasi-identifiers can be used to match units in the released data to other databases. For example, Sweeney4 showed that 97% of the records in publicly available voter registration lists for Cambridge, MA, could be uniquely identified using birth date and 9-digit zip code. By matching the information in these lists, she was able to identify Massachusetts Governor William Weld in an anonymized medical database. As the amount of information readily available to the public continues to expand (eg, via the Internet and private companies), investigators releasing large-scale epidemiologic data run the risk of similar breaches.
In this commentary, we present a primer on techniques for sharing confidential data. We classify techniques into 2 broad classes: restricted access, in which access is provided only to trusted users, and restricted data, in which the original data are somehow altered before sharing.
Sharing unaltered confidential data with external researchers, while preserving confidentiality, is referred to as restricted-access methods. If one or more researchers have been trusted to analyze confidential data, it seems logical that they should be able to share the data with additional trusted researchers. However, researchers may be prevented by law or by the policies of their institution or sponsoring agency from sharing the data, even with their colleagues. Without procedures and rules for sharing data, there is little an individual researcher can do.
There are 2 primary restricted-access methods employed by most data stewards, including government agencies and individual investigators: licensing agreements and restricted-data centers.5 Under the licensing model, researchers with legitimate research questions that can be answered with the confidential data enter into an agreement with the data steward and are provided with a copy of the data for use at their home institution. Typically, agreements specify how the data should be protected, and may require certification of destruction at project completion. Fees may be charged to cover data processing and administrative costs. In some cases, such as licensing agreements with the National Center for Education Statistics, the data are slightly altered to introduce a small amount of uncertainty, but not enough to concern analysts.
In the data-center model, researchers must carry out their research in a physically and electronically secure facility controlled by the data steward. Researchers are often required to submit a proposal describing the research plans, the confidential data needed, and in some cases how their research will benefit the data steward. In additional to travel expenses, researchers may be required to pay substantial fees to access the data center, and they cannot remove any output without approval from the data steward. The US Census Bureau and National Center for Health Statistics provide researchers access to confidential data in this manner.
An alternative is a secure online data center, such as the National Opinion Research Centers Virtual Data Enclave. Confidential data are stored on secure servers that approved researchers can access through remote login from prespecified IP addresses. The researcher can see and analyze the actual data, but the server disables local saving, printing of the data and analysis results, and cut-and-paste operations. As with a physical data enclave, analysis results are checked for potential disclosure violations by Center staff before being approved for publication outside the enclave. The virtual enclave has advantages over licensing models, in that the risks of breaches are centrally managed, for example, there are no misplaced CD ROMs or unapproved file sharing.
A variant on the virtual data enclave is the remote analysis server. This is a query-based system that provides the results of analyses of confidential data without actually allowing users to see the confidential values, thus reducing disclosure risks.6–8 Examples include the US Census Bureau's American FactFinder, which is available to anyone but produces only tabular summaries, or the Australian Remote Access Data Laboratory, which restricts users but allows more flexible analyses. Other query-based methods under research include verification servers,9 which provide users with feedback on the quality of analyses based on redacted public use data, and output perturbation through differential privacy,10 whereby noise is added to query results, so that they do not provide information about any individual with certainty.
Although restricted-access policies use trust and physical protection, restricted-data policies provide unfettered access to data that have been modified before release. The key issue for data stewards is determining how much alteration is needed to ensure protection without completely destroying the usefulness of the data. To make informed decisions about this trade-off, data stewards can quantify disclosure risk and data usefulness associated with particular release strategies. For example, when 2 competing policies result in approximately the same disclosure risk, the one with higher data usefulness should be selected. Quantifiable metrics can help data stewards decide when the risks are sufficiently low, and the usefulness is adequately high, to justify releasing the altered data.11,12
The literature on disclosure risk assessment highlights 2 main types of disclosures, namely (i) identification disclosures, which occur when an ill-intentioned user of the data (henceforth called an intruder) correctly identifies individual records in the released data, and (ii) attribute disclosures, which occur when an intruder learns the values of sensitive variables for individual records in the data. Attribute disclosures usually are preceded by identity disclosures—for example, when original values of attributes are released, intruders who correctly identify records learn the attribute values—so that focus typically is on identification-disclosure risk assessments. See the reports of the National Research Council13,14 and Federal Committee on Statistical Methodology15 for more information about attribute disclosure risks.16
Identification-disclosure risk measures are often based on estimates of the probabilities that individuals can be identified in the released data. Probabilities of identification are easily interpreted: the larger the probability, the greater the risk. Stewards determine their own threshold for unsafe probabilities. A variety of approaches have been used to estimate these probabilities.17–22 People who are unique in the population (as opposed to the sample) are particularly at risk.23 Therefore, much research has gone into estimating the probability that a sample unique record is in fact a population unique record.24,25
Usefulness of the data is usually assessed with 2 general approaches: (1) comparing broad differences between the original and released data and (2) comparing differences in specific models between the original and released data. Broad difference measures essentially quantify some statistical distance between the distributions of the data on the original and released files, for example, a Kullback-Leibler or Hellinger distance.26 As the distance between the distributions grows, the overall quality of the released data generally drops.27,28
Comparison of measures based on specific models is often performed informally. For example, the steward looks at the similarity of point estimates and standard errors of regression coefficients after fitting the same regression on the original data and on the data proposed for release. If the results are considered close (eg, if the confidence intervals obtained from the models largely overlap), the released data have high utility for that particular analysis.29 Such measures are closely tied to how the data are used, but they provide a limited representation of the overall quality of the released data. Thus, it is prudent to examine models that represent a wide range of uses of the released data.
Risks can be reduced by altering data, using one or more statistical-disclosure-limitation strategies. Common approaches include coarsening, data swapping, noise addition, and synthetic data. Examples of coarsening include releasing geography as aggregated regions, reporting exact values only below or above certain thresholds (known as top or bottom coding), collapsing levels of categorical variables, and rounding numerical values. Recoding reduces disclosure risks by turning atypical records—which generally are most at risk—into typical records. Recodes frequently are specified to reduce estimated probabilities of identification to acceptable levels (eg, no cell in a cross-tabulation of a set of key categorical variables has fewer than k individuals, where k is often 3 or 5).
Data swapping refers to switching the data values for selected records with those for other records.30 The protection afforded by swapping is based in large part on perception: data stewards expect that intruders will be discouraged from linking to external files, as any apparent matches could be incorrect due to the swapping. The details of the swapping mechanism are almost always kept secret to reduce the potential for reverse engineering.
Adding random noise to sensitive or identifying data values can reduce disclosure risks because errors are introduced into the released data, making it difficult for intruders to match on the released values with confidence.31,32 For numerical data, typically, the noise is generated from a distribution with a mean of zero to ensure point estimates of means remain unbiased, although the variance associated with those estimated mean may increase. However, zero-mean noise still can lead to attenuation in point estimates of correlations and regression coefficients.
Partially synthetic data methods replace some collected values, such as sensitive values at high risk of disclosure or values of key identifiers, with multiple imputations.8,33–35 For example, suppose that the steward wants to replace income when it exceeds $100,000—because the steward considers only these individuals to be at risk for disclosure—and is willing to release all other values as collected. (Alternatively, the steward could synthesize the key identifiers for persons with incomes exceeding $100,000 with the goal of reducing risks that these individuals might be identified.) The steward generates replacement values for the incomes over $100,000 by randomly simulating from the distribution of income conditional on all other variables. To avoid bias, this distribution must be conditional on income exceeding $100,000. The distribution is estimated using the collected data and possibly other relevant information. This yields one synthetic dataset. The steward repeats this process multiple times and releases the multiple datasets to the public. The multiple datasets enable secondary analysts to reflect the uncertainty from simulating replacement values in inferences. Among the restricted-data techniques described here, partial synthesis is the newest and least commonly used.
A stronger version of synthetic data is full synthesis, in which all values in the released data are simulated.36,37 This approach may become appealing in the future, if confidentiality concerns grow to the point where no original data can be released in unrestricted public use files.
To reproduce research conducted on confidential data, researchers will likely need access to the same confidential data used by the original investigators. It is possible in some cases for restricted-data methods to be used, but for the purpose of reproducibility, restricted-access methods hold more promise. This is not to say that restricted-data methods do not have their place, as they enable many other benefits of data sharing.
For both restricted data and restricted access, the methods that we describe may be too cumbersome for individual researchers to implement independently. We believe it is incumbent upon research institutions and sponsors, journals, and other advocates of reproducibility to facilitate sharing of confidential data by establishing procedures and policies for sharing confidential data and perhaps creating confidential data repositories. Authors who obtain their confidential data from such repositories could comply with a data-sharing invitation such as the one put forth by Epidemiology by providing detailed instructions for obtaining the confidential data.
ABOUT THE AUTHORS
JEROME REITER is the Mrs. Alexander Hehmeyer Associate Professor of Statistical Science at Duke University and the current chair of the American Statistical Association Committee on Privacy and Confidentiality. SATKARTAR KINNEY is research scientist at the National Institute of Statistical Sciences. Their research focuses on methodology for protecting confidential data. They have worked closely with the Census Bureau to develop a public-use file for the Longitudinal Business Database—the first ever public-use, establishment-level data product in the United States.
1.Hernan MA, Wilcox AJ. Epidemiology, data sharing, and the challenge of scientific replication [commentary]. Epidemiology
2.Donoho DL. An invitation to reproducible computational research. Biostatistics
3.Sedransk N, Young LJ, Kelner KL, et al. Make research data public? – Not always so simple: A dialogue for statisticians and science editors. Statist Sci
4.Sweeney L. Computational Disclosure Control: Theory and Practice
[doctor's thesis]. Cambridge, MA: Department of Computer Science, Massachusetts Institute of Technology; 2001.
5.Kinney SK, Karr AF, Gonzalez JF. Data confidentiality: the next five years summary and guide to papers. JPC
6.Gomatam S, Karr AF, Reiter JP, Sanil AP. Data dissemination and disclosure limitation in a world without microdata: a risk-utility framework for remote access analysis servers. Statist Sci
7.Reiter JP. Model diagnostics for remote-access regression servers. Statist Comput
8.Reiter JP. New approaches to data dissemination: a glimpse into the future (?). Chance
9.Reiter JP, Oganian A, Karr AF. Verification servers: enabling analysts to assess the quality of inferences from public use data. Comput Stat Data Anal
10.Dwork C. Differential privacy. In: The 33rd International Colloquium on Automata, Languages, and Programming, Part II. Berlin: Springer; 2006:1–12.
11.Willenborg L, de Waal T. Elements of Statistical Disclosure Control.
New York: Springer-Verlag; 2001.
12.Duncan GT, Keller-McNulty SA, Stokes SL. Disclosure risk vs. data utility: the R-U confidentiality map. Research Triangle Park, NC: US National Institute of Statistical Sciences; 2001. Technical Report.
13.National Research Council. Expanding Access to Research Data: Reconciling Risks and Opportunities.
Panel on data access for research purposes, committee on national statistics, division of behavioral and social sciences and education. Washington, DC: The National Academies Press; 2005.
14.National Research Council. Putting People on the Map: Protecting Confidentiality With Linked Social-Spatial Data.
Panel on Confidentiality Issues Arising from the Integration of Remotely Sensed and Self-Identifying Data, Committee on the Human Dimensions of Global Change, Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press; 2007.
15.Federal Committee on Statistical Methodology. Statistical Policy Working Paper 22: Report on Statistical Disclosure Limitation Methodology (2nd version). Washington, DC: Confidentiality and Data Access Committee, Office of Management and Budget, Executive Office of the President; 2005.
16.Lambert D. Measures of disclosure risk and harm. J Off Stat
17.Duncan GT, Lambert D. Disclosure-limited data dissemination. J Am Stat Assoc
18.Duncan GT, Lambert D. The risk of disclosure for microdata. J Bus Econ Statist
19.Fienberg S, Makov UE, Sanil AP. A Bayesian approach to data disclosure: optimal intruder behavior for continuous data. J Off Stat
20.Domingo-Ferrer J, Torra V. Disclosure risk assessment in statistical microdata via advanced record linkage. Statist Comput
21.Reiter JP. Estimating risks of identification disclosure for microdata. J Am Stat Assoc
22.Shlomo N, Skinner CJ. Assessing the protection provided by misclassification-based disclosure limitation methods for survey microdata. Ann Appl Stat
23.Bethlehem JG, Keller WJ, Pannekoek J. Disclosure control of microdata. J Am Stat Assoc
24.Elamir E, Skinner CJ. Record-level measures of disclosure risk for survey microdata. J Off Stat
25.Skinner CJ, Shlomo N. Assessing identification risk in survey microdata using log-linear models. J Am Stat Assoc
26.Shlomo N. Statistical disclosure control for census frequency tables. Int Stat Rev
27.Domingo-Ferrer J, Torra V. A quantitative comparison of disclosure control methods for microdata. In: Doyle P, Lane J, Zayatz L, Theeuwes J, eds. Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies
. Amsterdam: North-Holland; 2001:111–133.
28.Woo M, Reiter JP, Oganian A, Karr AF. Global measures of data utility for microdata masked for disclosure limitation. JPC
29.Karr AF, Kohnen CN, Oganian A, Reiter JP, Sanil AP. A framework for evaluating the utility of data altered to protect confidentiality. Am Stat
30.Dalenius T, Reiss SP. Data-swapping: a technique for disclosure control. J Stat Plan Inference
31.Brand R. Micro-data protection through noise addition. In: Domingo-Ferrer J, ed. Inference Control in Statistical Databases
. New York: Springer; 2002:97–116.
32.Kim JJ. A method for limiting disclosure in microdata based on random noise and transformation. In: Proceedings of the Section on Survey Research Methods of the American Statistical Association; 1986:370–374.
33.Little RJ. Statistical analysis of masked data. J Off Stat
34.Reiter JP, Raghunathan TE. The multiple adaptations of multiple imputation. J Am Stat Assoc
35.Rubin DB. Statistical disclosure limitation. J Off Stat
36.Raghunathan TE, Reiter JP, Rubin DB. Multiple imputation for statistical disclosure limitation. J Off Stat
37.Reiter JP. Releasing multiply-imputed, synthetic public use microdata: an illustration and empirical study. J R Stat Soc Ser A Stat Soc