Researchers are increasingly being called upon to share their data. For example, all large grant proposals submitted to the National Institutes of Health (NIH) must contain a data-sharing plan.a The plan must either state how the investigators will make their data publicly available or specify their reasons for not doing so. Sharing data and software code with other researchers is perceived as a way of promoting scientific inquiry, deterring misconduct, and providing additional validation of scientific findings.b
Some journals, such as the International Journal of Forecasting, recently changed their editorial policy to require contributors to provide whatever material is needed (e.g., data, computer code) to allow results to be replicated by other researchers.1 However, many of this journal’s studies pertain to economic or organizational data (e.g., quarterly sales of automobiles). For their articles, the issue of protecting individual privacy or obtaining IRB approval does not arise. By contrast, the stealthy dissemination of a person’s medical history may cause significant harm. For example, employers could use this information for hiring and promotion decisions, and banks could use it to estimate which borrowers are likely to default on their mortgages.c
In this article, we consider the privacy implications of posting data from small, randomized trials, observational studies, or case series in anesthesia from a few (e.g., 1–3) centers involving 4 to 40 patients per group. Examples of such studies would include 40 patients randomized into 2 groups for a pharmacokinetic analysis; 40 patients in a registry with malignant hyperthermia or 40 patients having magnetic resonance imaging under general anesthesia at night. Such studies are common in anesthesia. With respect to larger studies involving thousands of patients, there are multiple additional issues to be considered, because such studies often use data based on secondary agreements (e.g., from a state). The 4 scenarios in Table 1 include 2 realistic examples of privacy breaches in the context of an anesthesia journal and 2 examples of persons being re-identified from anonymized data.
There has been much work in recent years on how to enable the publication of a dataset without violating the privacy of individual data owners (e.g., patients).b One cannot expect to achieve both perfect privacy and perfect data utility, for maximizing one objective will come at the expense of the other. At one extreme, one can suppress all the information and achieve perfect privacy but no utility. Alternatively, one can publish the entire database without editing or controls, delivering maximum utility and maximum risk of privacy disclosure. The essence of privacy-preserving data publishing is then to discern how best to make this trade-off (i.e., which segments of the data to disclose and which to hide).
In Section 2, we present the ubiquitous “anonymization framework,” which is the foundation for almost all modern-day privacy-preserving techniques. Table 2 provides definitions of terms in this article. Section 3 describes methods of attack to re-identify individuals from anonymized data. Section 4 outlines methods of defense to protect against such attacks. Section 5 focuses on the legal aspects of health privacy, known as Health Insurance Portability and Accountability Act “(HIPAA)” within the United States. Section 6 presents a case study of a “simulated attack” that illustrates the methods an adversary might use to link anesthesia data from small, clinical trials to the Texas inpatient database. Section 7 proposes an editorial policy that seeks to find a third alternative between 2 absolutes: “perfect privacy” versus “complete transparency.” Appendix A contains details about logistic regression and other statistical techniques, explaining why it is not possible to both exclude some variables (e.g., hospital and surgical procedures) and have the resulting dataset sufficient for replicating the authors’ analysis.
2. ANONYMIZATION FRAMEWORK
The anonymization framework, as defined in Table 2, is the foundation of all or most privacy-preserving methods in use today.d Governments, journals, and other producers and consumers of “big data” rely on anonymization, also known as “de-identification,” to protect consumer privacy. Anonymization is also the basis for various legal and regulatory frameworks that govern the use and exchange of health information (e.g., “HIPAA” in the United States and the “Data Protection Directive” in the European Union). Prior to making the data available as secondary digital content, researchers and journal editors will use some protocol to anonymize the data.
The major premise of anonymization is that all attributes in a given database fall into 2 categories: “personally identifiable” and “not personally identifiable” (Table 2). “Personally identifiable” information is defined as information that can be used alone or in combination to identify an individual.e,f Examples of such information include name, social security number, and telephone number. Prior to the data being posted as supplemental digital content, all personally identifiable information are redacted, modified, or generalized in some way, such as by truncating the last 2 digits of a 5-digit postal code.
Sometimes, data that each is not personally identifiable can be used in combination to identify an individual, a process known as “re-identification” (Table 2). For example, knowing a person’s date of birth is insufficient, by itself, to identify that individual. Yet, 87% of the specific combinations of date of birth, postal code, and gender occur only once among the entire US population.2 The risk of re-identification is inversely proportional to the size of the smallest subgroup that share a given set of attributes (e.g., there is a 5% = [1/20] chance of identifying someone from a group with 20 members). Thus, the “population uniqueness” (e.g., 87% or 5%) is a widely used heuristic to assess the vulnerability of a given dataset to re-identification.
The main limitation of anonymization lies in determining which attributes are “personally identifiable,” as there is no statistical test or objective standard to guide this process. Thus, the definition of “personally identifiable” information is based on cultural norms and legal precedents that are specific to a given country or jurisdiction. Within the United States, for example, the conventional wisdom on which attributes are “identifiable” is specified in the “safe harbor provision” of the HIPAA law, hereafter referred to as “safe harbor.” Safe harbor defines 18 specific attributes as “protected health information” (PHI) (e.g., name, cell phone number, and medical record number).g Thus, to emphasize, there is no clear definition for personally identifiable information,f and PHI is not a synonym.
One problem with defining a fixed set of attributes as PHI is that information that is not considered PHI today may be reclassified as personally identifiable information by future researchers. For example, even though data from anesthesia studies generally do not include postal codes, an adversary could infer such personally identifiable information from other information. In Appendix A, we explain why the hospital where the patient receives care needs to be included when the attribute significantly influences results, directly or indirectly. If a study includes 3 rural hospitals, logically physically distant from one another, and a patient went to one of those hospitals, then there is a high likelihood that an adversary could infer the patient’s postal code from the hospital. In theory, any information about a person can be used to identify that individual.
In theory, HIPAA was intended to protect the privacy of all health information. However, by defining a specific set of attributes as “PHI,” the safe harbor method has effectively defined a “second class” of health information, which we call “unprotected health information.”f Examples of unprotected health information include hospital name, diagnosis codes (e.g., drug addiction), month or quarter of a year when surgery was performed, and all procedure codes (Appendix A). The focus of our article is not on the risks of publishing of PHI, since these risks are well known and have been addressed by HIPAA within the United States and other rules outside the United States (see Section 5). Rather our focus is on the privacy risks that stem from unprotected health information, since it is generally assumed that the publication or exchange of such data poses minimal privacy risks. We shall demonstrate in Section 6 that this is not so. Whereas the strict definition of “PHI” (e.g., how many attributes to “protect”) will vary by region (e.g., the Data Protection Directive within the EU), all of these privacy laws share one common trait: a reliance on the anonymization framework.
Compliance with the safe harbor standard does not protect against all types of privacy attacks and is no guarantee that no privacy breach will occur.3 Researchers have demonstrated this by re-identifying individuals from a range of databases that were de-identified according to various protocols.4 The threshold level of risk of re-identification used often by national organizations (e.g., by IRBs) is no more than 4 out of 10,000.3 This standard of 0.04% is based on “population uniqueness,” which is the likelihood that a particular combination of attributes is unique to one individual. To interpret 4 out of 10,000, consider that approximately 0.04% of the US population has a unique combination of age in years, gender, and the first 3 digits of their 5-digit postal code.h While 4 out of 10,000 may sound like a stringent standard, it could still pose an unacceptable level of risk. For example, a uniform risk of 0.04% implies that, for a state database of 2.8 million hospital records, about 1120 persons are at risk of having their health information compromised. Yet, this threshold of 0.04% may represent a best-case scenario. (i.e., We show below that, in practice, a greater percentage of patients may be identifiable.)
Intuitively, we may think that releasing a small dataset entails a “small risk” of privacy disclosure. However, this does not generally hold, especially if an adversary knows someone who participated in the study. For example, at a dinner party, one person says that he is going to have knee replacement surgery and was told that there would be a nerve block to reduce pain. A colleague says that while having that surgery at the same hospital, he was in a study of nerve blocks. Someone at the party knows the year of surgery. There was just one study published from that hospital involving regional anesthesia for joint arthroplasty during that year. With 40 patients randomized to each of 2 groups, the supplemental digital content has vastly exceeded the generally accepted risk of re-identification (i.e., 1/40 instead of 0.04%). Since an adversary would also know the colleague’s sex, age category (decade), race, and ethnicity, the risk of re-identification would likely be even greater (e.g., 50% rather than 1/40). Thus, even when “hospital” is not a field in the data, for small anesthesia studies, it is effectively reported based on the IRB declaration and author affiliation.
3. ADVERSARIES AND METHODS OF ATTACK
A central concept in data privacy is that of the “adversary.” Given the privacy scenarios in Table 1, adversaries may include, for example, an estranged spouse, a reporter from a local newspaper, or an attorney who is looking for potential clients. The adversary is assumed to be indifferent to privacy laws, data use agreements, legal sanctions, and pangs of conscience.
Defending a database against attack is inherently more difficult than performing a successful attack. Whereas the defender must succeed every time, the attacker has to succeed only once for a privacy breach to occur. Although the probability of matching an individual record might be quite small (e.g., 1 in 10,000), a large database may contain thousands or even millions of records, making the likelihood of at least one “success” a virtual certainty. Hence, risk analysis is a study of “worst-case scenarios.” Within the context of anesthesia data, the “worst-case” is that the adversary: (a) knows the patient (e.g., the dinner party of Section 2); (b) will search the Internet for public records (e.g., real estate purchases, thus providing geographic location); and (c) has access to “semi-public” databases (e.g., state inpatient data, which can be purchased for a few hundred dollars). If any one of these assumptions does not hold, then the likelihood that the adversary could succeed is reduced. The primary risk of publishing small datasets stems from the likelihood that records can be linked to large databases (e.g., Google search providing a picture of the person with a caption essentially revealing sex, race, age within a decade, and city of residence).
The first step of most privacy attacks is to link 2 or more databases based on overlapping attributes. For example, suppose a hospital database contains the patient’s date of birth, postal code, and gender. A voter registration database includes these same attributes, along with the voter’s name and address. The voter registration database is public.i One seminal study found that 87% of the US population was identifiable through the combination of date of birth, postal code, and gender.2 By linking the voter registration list to the health care database, the researchers were able to re-identify the medical records of William Weld, the former Governor of Massachusetts.j For purposes of patient privacy, what matters is not the data itself to be provided in supplemental digital content, but how the data can be used along other publicly available information.
We refer several times in our article to Netflix’s release of customers’ rankings of movies (Table 1). The database consisted of >100 million movie ratings of 17,700 movies by >480,000 consumers.k The total number of “cells” in this matrix exceeded 8.4 billion (i.e., movies multiplied by consumers). Hence, about 99% of the cells were zero; all relevant information was contained in only 1% of the cells. This is because even an avid movie-watcher sees only a small fraction of all possible movies. This is known in computer science as the problem of “sparsity.” Owing to this sparsity, the Netflix data were vulnerable to re-identification.
In terms of sparsity, health care databases are similar. For example, the 2013 inpatient database from the state of Texas includes 255 distinct attributes for every inpatient stay, including up to 24 diagnosis codes and 24 procedure codes.l Under the International Classification of Diseases and Injuries, version 9, Clinical Modification (ICD-9-CM) rubric, there are approximately 3800 procedure codes and 14,000 diagnosis codes.m Thus, the number of possible combinations of procedure codes and diagnosis codes is conceptually and, in actuality, virtually limitless.6 Even a patient having multiple procedures would have only a miniscule fraction of all possible procedures (e.g., 4 out of 3800).
Analogous to the Netflix example, data to be published from an anesthesia study could be nothing but the hospital and list of surgical procedures. Because there are thousands of procedure categories, this information can be sufficient (by itself) to match particular records to the state database. We test this proposition in Section 6.
For anesthesia studies, the variables most likely to result in identification of individuals are the combination of hospital and surgical procedure(s). The former is described below. The latter is because many cases at hospitals are of uncommon combinations of procedures.6–9 Esophagogastroduodenoscopy is an example of a common procedure.7 Anoplasty and anorectal myomectomy are examples of uncommon procedures.8,9 There are thousands of different procedure codes and combinations. Twenty percent (SE 1%) of outpatient surgery cases performed in the United States from 1994 to 1996 were procedures performed annually no more than 1000 times nationwide. Compared to the few very common procedures,6,8 these many uncommon procedures (i.e., those of the median incidence) are each performed about 100 times less frequently.6,8
For example, within the state of Iowa, more than two-thirds of all rare physiologically complex pediatric surgery was performed at one hospital.10–12 Suppose that a case series describes the anesthetic approach and outcomes for 20 children who underwent a rare procedure at the University of Iowa. Then, if an adversary knows from secondary information that a child from Iowa underwent surgery for that rare procedure, there would be a no less than 1 in 30 (= [2/3] × [1/20]) risk of re-identification. This is vastly greater than the 4 out of 10,000 accepted standard for population uniqueness.
Hospital and procedure cannot generally be excluded from anesthesia studies (Appendix A). Thus, an adversary who knows all of a patient’s procedure codes could narrow down the group of potential matches to one, or at most a few, patient. If the data provided by the authors could include any hospital in a state or province, identifying a specific patient would be much more difficult. However, most studies with few patients (n ≤ 30) all come from the same hospital. Even without stating the hospital name in the article, it can often be inferred from the authors’ institutional affiliation, as well as the institution granting IRB approval. The specific hospital is then combined with the specific procedure(s), and the remaining data in the state database for that patient are known (e.g., all current and past diagnoses).
4. METHODS OF DEFENSE
In this section, we provide a brief overview of various defensive measures and their relevance to the publication of data from clinical research. A commonly accepted trade-off is to disclose aggregate views of the underlying data (e.g., SUM, COUNT, MIN, MAX) over all or part of the dataset, while hiding as much information about the individual records as possible. The rationale is 2-fold. First, aggregates are sufficient for many data analytic applications (e.g., statistical analysis, data mining, and machine learning). So long as aggregate views can be computed accurately from the underlying dataset, data scientists may not require access to individual records to make statistical inferences. Second, aggregates are considered “safer” (at least intuitively) from a privacy standpoint. For example, while disclosing the age of a patient could be viewed as a privacy violation, publishing a histogram of the age distribution of a community (e.g., postal code) is usually deemed harmless.
Given the objective of enabling aggregate query processing while reducing individual privacy disclosure, numerous privacy protection frameworks have been proposed and studied. In what follows, we review 2 general approaches: query auditing and data perturbation/generalization. We also consider the practical implications of each technique as it relates to the publishing of anesthesia data.
a. Query Auditing
Query auditing is the process by which data queries are checked to ensure that they cannot be used to discover confidential information. One straightforward method of query auditing is to allow aggregate queries while disallowing all other types of queries. To illustrate, a researcher has access to the database only indirectly, by submitting queries (i.e., data requests) to the database owner. If the question is aggregate in nature (e.g., “How many patients underwent a radical prostatectomy?”), then the database owner responds by answering the question, as shown in Figure 1. If the question pertains to a single patient (e.g., “What is the birth date of the patient from postal code 10357 who underwent a radical prostatectomy?”), then the question is disallowed by the database owner.
Intuitively, query auditing would appear to provide an appropriate balance between privacy and utility by filtering out all queries pertaining to individual records while allowing aggregate queries to pass. In practice, however, this method has significant drawbacks. To illustrate, consider a user who issues 4 queries sequentially:
- (1) the number of patients from Iowa City, Iowa, which returns “58 patients”;
- (2) the number of US Medicare patients from Iowa City, which returns “57 patients”;
- (3) the number of patients from Iowa City who had a radical prostatectomy (ICD-9-CM procedure code 60.5), which returns “13 patients”;
- (4) the number of US Medicare patients from Iowa City who had a radical prostatectomy, which returns “12.”
All these queries are aggregate in nature and none looks particularly suspicious, as each response includes at least 12 patients. However, by combining query results, the adversary could make a dangerous discovery. Specifically, one first combines (1) and (2) to conclude there is only one patient from Iowa City who is not a Medicare patient. Then, from (3) and (4), one can infer there is one non-Medicare patient from Iowa City who had a radical prostatectomy. Combining the 2 findings, the adversary concludes that the single patient from Iowa City who is not insured by Medicare must have had a radical prostatectomy—a severe privacy disclosure for the patient.
The possibility of deriving confidential data about individual patients from combinations of innocent-looking aggregates (e.g., (1) to (4)) is referred to as the inference problem in data privacy. To address this problem, query auditing can be used, whereby an archive is maintained of all queries ever made by each user. Before answering a new query, the software first determines whether the user could use the new information, combined with the information from previous queries, to make an inference (i.e., as in the previous example). The query is answered only if such an inference is proven impossible. In the literature on query auditing, numerous techniques have been proposed, guarding against both exact disclosures (i.e., preventing an adversary from learning the exact value of a private attribute) (see survey13) and partial disclosures (i.e., where even learning partial information over a predetermined threshold is prevented).14 Query-auditing techniques are used extensively in anesthesia quality databases (e.g., that of the American Society of Anesthesiologist’s Anesthesia Quality Institute).
While query-auditing techniques can be highly effective in controlled environments, they probably are ill-suited to scenarios involving journals publishing health care data as supplemental content. First, query-auditing techniques are interactive, not archival like a journal’s supplemental data files. Second, query auditing assumes that no 2 users can collude with each other to combine their query records. If users were to collude by pooling their respective queries, they may be able to infer confidential information that would otherwise be hidden. When access to such query filtering software can be obtained simply by clicking a hyperlink in a PDF file, there may be tens of thousands of potential users, making such collusion impossible to prevent.
b. Data Perturbation/Generalization
An alternative strategy to query auditing for data privacy is to manipulate the values of data records to be published, as shown in Figure 2. The key premise is that, while the manipulation may significantly change the value of each individual record, unbiased and precise estimates for the aggregates of interest can still be constructed.
To illustrate the concept of data perturbation, consider an example where the aggregate of interest is the mean height of all patients in a dataset. A simple perturbation can be achieved by adding an independently generated random number from a Gaussian distribution with mean of 0 and SD of 50 cm. In this scenario, an individual patient’s height is substantially changed (specifically, more than half of all patients would have their heights changed by >0.5 m). Individuals cannot be identified from the dataset. However, as long as the sample is large enough, the sample mean height of all patients in the perturbed dataset can easily be recovered. For example, if there are 10,000 patients in the dataset, then the “bias” introduced into the data by this perturbation has a SE of
= 50/100 = 0.5 cm. Using ±3 SDs as a benchmark, one concludes that the mean height computed from the perturbed dataset is at most ±1.5 cm from the original value, a negligible error for most applications that depends on knowing the mean compared with individual values.
In the above example, the computation of the aggregate happens to be straightforward from the perturbed dataset (i.e., a simple mean of all perturbed heights). However, this is the exception rather than the norm. Most aggregates require complex computations to obtain unbiased and precise estimates. As an example of an aggregate that requires a more nuanced computational process, we consider the percentage of all patients in the dataset with a history of cancer. Suppose that we apply a perturbation strategy that randomly “flips” the true value from 0 to 1 or 1 to 0, indicating “cancer” or “no cancer,” with a 40% probability. A patient without cancer in the original dataset has a 40% chance of their cancer status being changed from 0 to 1. For this perturbation strategy, if the true population mean of patients with cancer is 10% in the original dataset, then the perturbed dataset will contain an expected 10% × (100% − 40%) + (100% − 10%) × 40% = (0.1 × 0.6) + (0.9 × 0.4) = 42% cancer patients, a significant departure from the true population proportion of 10%. However, so long as the researcher is aware of the perturbation being applied, she/he can recover (a close estimate) of the original percentage. If, for example, the perturbed dataset contains 42.4% cancer patients, then the researcher simply needs to solve the following equation:
to calculate v, the estimate of the original percentage. In this example, the solution is v = 12%, slightly different from the true population mean of 10%. This is likely adequate for analyses based on aggregate incidences.
Unfortunately, data perturbation techniques would be impractical for anesthesia datasets with 40 or fewer patients in each of the 2 groups, because small samples yield large SEs and wide confidence intervals for the aggregate statistics. Moreover, data generalization for small samples can lead to categories with 0 or 1 member, which greatly increases the risk of disclosure. Furthermore, the data perturbation techniques would make it impossible to evaluate whether the published data are potentially fraudulent because “perturbed” data and intentionally falsified data cannot be distinguished. This would nullify one of the major reasons for making the data available.15 Finally, when study results depend on the relationships among measurements within subjects (e.g., in pharmacokinetic/pharmacodynamic studies), analyses could not be reproduced and sensitivity of conclusions to analytical methods could not be explored.
Whereas data perturbation techniques require some knowledge of statistics, another popular strategy for protecting data is generalization. This refers to generalizing a specific value to a range of values or categories (e.g., a 27-year-old patient’s age is reported as “20–29”). Additive noise and generalization are similar, in that both aim to apply substantial changes to individual records, yet enable the accurate estimation of aggregates over many records. Nonetheless, the 2 techniques have significant differences. Data generalization is more intuitive and familiar than additive noise and is often perceived by the public as providing greater privacy protection.n Hence, this technique has often been used for state databases (e.g., when reporting patient ages and postal codes).
The most widely used and studied data generalization technique is known as k-anonymity.16 To illustrate how this works, consider a patient dataset with 2 attributes: height and weight. Every patient might originally have a different height/weight combination (e.g., Alice is 1.57 m tall and weighs 54 kg, and Bob is 1.73 m tall and weighs 64 kg). After generalization, both of their records become (1.50, 1.75 m) and (50, 70 kg), making it impossible to distinguish between the 2 records. In other words, each record is “hidden” among k − 1 others (in this example, k = 2), a privacy guarantee that is intuitive and easily understood.
Even as data perturbation techniques are widely used for large databases, they may still be vulnerable to privacy threats due to attribute correlations and adversarial knowledge from external sources. The correlation between different attributes may enable an adversary to filter out added noise (or reverse the generalization) and recover the original record value (or a close approximation of it). This is called “attribute correlation.” While this might appear surprising, the underlying principle has been well understood for decades in communication theory.17
To use a simple example, a patient’s gender is suppressed, but ICD-9-CM procedure codes are unchanged. Thus, although a patient’s gender is considered “unknown,” for some patients it can be obtained from the surgical procedure (e.g., 68.5 vaginal hysterectomy). For the defender, it is extremely difficult to enumerate all possible correlations between attributes, especially when (1) there are many attributes in the dataset and (2) the correlation involves >2 attributes. We consider for scientific journals that these correlations among attributes may not be known when the original article and its secondary data are published (i.e., the correlation may only be discovered later by future researchers).18
As discussed in Sections 2 and 3, another threat to privacy comes from certain knowledge an adversary may acquire from sources other than the published (perturbed) dataset, knowledge referred to as external knowledge. For example, suppose a Senator is 2 m tall, and his height has been mentioned in numerous articles. Given this information, an adversary could easily determine whether the Senator was included in the sample. This would correspond to the sample with an upper height limit of exactly 2 m. If the upper limit is <2, then the Senator is not included in the sample. Therefore, the upper bound should correspond to the height of the tallest person, because any privacy-preserving technique would want to reduce the applied perturbation to derive maximum utility from the perturbed dataset. For our focus on journals providing data, these types of attacks may be difficult to defend against. The authors and editors would need to know the types of external knowledge to which a future adversary will have access.
While the methods available to those who would undermine privacy have undergone rapid development, the methods of “defense” have not achieved similar breakthroughs. In spite of the technological progress in the field of information privacy, the task of defending such databases has not gotten easier, as these advances have often led to new threats. In the next section, we consider the relevant legal frameworks that pertain to health data privacy. We illustrate the inherent limitations of such approaches, for example, the HIPAA safe harbor method.
5. HIPAA, SAFE HARBOR, LEGAL ISSUES, AND IRB APPROVAL
As stated in Section 2, the most widely used method in the United States for ensuring health data privacy comes from the Safe Harbor Provision of the HIPAA privacy rule.o Other examples of such privacy frameworks include Ontario, Canada’s “Personal Information Protection and Electronic Documents Act” and the European Union’s “Directive on Data Protection.”p However, the HIPAA law was passed in 1996 and is based on an outdated and oversimplified conception of information privacy. This was out of necessity, because more nuanced or complex methods would present practical difficulties to being adopted by every health care provider. HIPAA provides an alternative to “safe harbor” known as the “statistical standard.”19 While applying the statistical standard requires technical expertise, it also may offer stronger privacy protections.
According to HIPAA, the methodology for protecting patient privacy is not restricted only to those attributes that are contained within the given database. Rather, the method must also consider how the data could be used in combination with “other reasonably available information” to re-identify an individual. This is what we did in Section 2 in our example of the dinner party. Another example would be an adversary who obtains a coworker’s postal code from an employee directory or her date of birth via a company-wide e-mail of “birthday announcements.” Thus, by extension, a clinical journal’s responsibility for protecting patient privacy is not limited to only those attributes published in or with its articles. The journal should also consider how this information could be combined with other reasonably available information, such as what might be found in newspaper articles and public databases, as shown by the Sweeney study in Table 1. However, to clarify, journals are not “covered entities,” as defined by HIPAA and therefore do not have to comply with the safe harbor protocols. For example, they are not required to publish the patient’s age in “years” instead of “months” or to redact dates of surgery, although both are classified as PHI. There is an ethical responsibility, but not a legal one.
In theory, the HIPAA law was intended to protect the privacy of all patient data. In practice, however, privacy protections are significantly weaker for health data that contain no PHI. For example, suppose that an unencrypted computer is stolen from a hospital office. While this computer contains unprotected health information, it does not contain any PHI. Using only non-PHI, the adversary is able to discover the identities of several patients and posts their medical history on a Web site, along with their names and addresses. Whereas the adversary could be sued for violating HIPAA, the case against the hospital would be harder to prove. Even though the theft of the hospital’s computer led to the privacy breach, the hospital could argue that, because no PHI was involved, the missing data were “HIPAA-compliant.” Moreover, under the “Breach Notification Rule,” the hospital would not be required to report the incident.q This would give the hospital substantial protection against potential lawsuits from dissatisfied patients.
Compliance with the safe harbor standard is often considered a “good enough” method of privacy protection, at least from the perspective of an organization’s legal obligations and liabilities. In this manner, the societal “privacy problem” is transformed into an organizational “compliance problem.” A computer scientist who warns that our faith in anonymization is misplaced may be viewed with skepticism by health care executives, who will likely respond that all of their databases are HIPAA-compliant.
At the institutional level, IRBs also have a role to play in ensuring that the privacy rights of study subjects are not violated. For example, the IRB could prohibit the researchers from including their data, even though required by a journal, unless the researchers could show that this would pose a “minimal risk” to patients (e.g., <4 in 10,000 chance of re-identification described in Section 2). The problem is that, at least in the United States, for small anesthesia studies such a standard would practically never be satisfied because there are publicly available data and discharge abstract data available for purchase.
Returning to the example of the NIH data-sharing plan, a researcher could demonstrate that all “identifiers” have been removed in accordance with HIPAA. As in the hospital example, this would limit the researcher’s liability in the event of a privacy breach. The NIH does not specify additional privacy requirements, beyond those that may be required by state and Federal laws, as well as IRBs. To the extent that these data-sharing plans rely on HIPAA for privacy protection, they may also be vulnerable to re-identification.
6. CASE STUDY USING TEXAS INPATIENT DATABASE
Suppose that an adversary had access to a state database of hospital inpatient discharges that contained multiple procedure codes containing sensitive medical information. How difficult would it be for this adversary to match the data in anesthesia records to an external database? We addressed this question by using the Texas Inpatient Public Use Data File for 2013 from the Texas Department of State Health Services. The database includes >2.8 million records (rows) and 255 distinct attributes (columns), including up to 24 procedure codes. The first step in the process was to select some attributes of the state database that overlap with data commonly presented in anesthesia case series or small clinical trial (i.e., available in journal articles’ secondary data sets). Useful information about the anesthetics would typically include at least the patient sex and surgical procedures (see above). The hospital name can be inferred from the author’s affiliation. (Again, we are considering a case series or small trial, so typically this would be one hospital.) The quarter of the year could be inferred from when the study was performed (e.g., “January and February 2015”). These overlapping attributes are displayed in Figure 3.
All patients were included whose primary procedure indicated a surgical procedure, “narrowly defined” (n = 836,923). The “narrow” definition is from the Agency for Healthcare Research and Quality’s Healthcare Cost and Utilization Project’s Surgery Flag Software.r These are the major surgical procedures (e.g., thoracotomy). We calculated the percentage of patients who are uniquely identified from the combination of hospital, sex, quarter, and procedure code(s).
As shown in Table 3, patients who underwent only one procedure had a uniqueness of 16.3%, which included 59% of all patients in the sample. However, the percent uniqueness increased to 64% for patients who underwent 2 procedures during their hospitalization. For patients undergoing 3 or more procedures, the percent uniqueness was 80% or greater. Note that these are procedures, not anesthetics (cases) (i.e., typically this would still be just one anesthetic [case]).6,8 For a patient selected at random from this population, the percent uniqueness was 42.8% (SE < 0.1%). Thus, an adversary would have about a 42.8% chance of linking the anesthesia record to the hospital database, and thereby discovering the patient’s sensitive information. This is just from a public database released by the state. We did not consider other sources of information to which the adversary would have access (e.g., Google search of newspaper stories, Twitter, and other social media Web sites). In practice, the probability that an adversary could match a patient’s record to external databases would be even greater than the 42.8%. This would seem to represent an unacceptably high level of risk.
Moreover, Texas is the second largest state in the United States. Consider, for example, Iowa, which is 8 times smaller in terms of population. If we were to repeat the above analysis using data from the state of Iowa, the percent uniqueness would be significantly greater. Specifically, El Emam performed a risk analysis for all 50 states and found that the risk of exposure was >4 times higher for Iowa than for Texas.3 The 2 states with the greatest risks were Wyoming and North Dakota. This demonstrates that smaller databases entail greater risks for individual patients.
Note that an adversary can purchase the state database legally and then attempt to match them to published anesthesia records, using the overlapping attributes in Figure 3. The process does not involve “hacking” (i.e., gaining unauthorized access to sensitive information). The adversary does not require access to confidential data because all of the relevant data were either in the public domain or available for purchase.
Coding systems are periodically revised to reflect innovations in surgical techniques and to include greater specificity (e.g., right versus left), and with each revision, the number of categories inevitably expands. As of October 1, 2015, hospitals in the United States were required to make the transition from ICD-9-CM to ICD-10-CM. Consequently, the number of procedure codes increased from about 3800 to >71,000. Hence, the percent uniqueness, as defined above for Texas, would be significantly greater. Other medical coding systems (e.g., SNOMED) have more than one million categories, and with genomic data, the level of complexity is even greater.s
7. DISCUSSION AND CONCLUSIONS
The purpose of this article was to evaluate the risks to patient privacy from including small datasets from anesthesia studies as secondary digital content. We surveyed examples of successful privacy attacks and the latest methods from computer science to protect against them. Using the State of Texas database, we showed that there is a 42.8% chance that an adversary could match an anesthesia record to a public database. The percentage is greater for patients undergoing multiple procedures, from smaller states, and for other procedure classification systems such as ICD-10-CM.
As the literature on this topic is voluminous and changing, this review article could only provide an overview of the most salient methods and current controversies.20 However, we think that the preceding pages are sufficient for a few essential takeaway messages. First, the task of protecting sensitive health information is far more challenging and complex than simple compliance with a known standard (e.g., safe harbor). Second, the editorial policies that have been adopted by research journals in other fields (e.g., economics and management science) may not be appropriate for clinical journals. This is partly for technical reasons, such as the sparsity of health care databases. It is also due to the sacrosanct nature of medical data itself and the potential loss of trust that would occur if such data were re-identified. Third, a small dataset does not imply a small risk of disclosure, especially if an adversary knows someone who participated in the study; rather, it is the opposite.
Few people today would think that the combination of hospital and surgical procedures could be sufficient to match data from a small, observational study to a single inpatient record out of a database of millions. After all, neither hospital name nor surgical procedures are PHI. Hence, the use or exchange of these data is largely unregulated. Although advanced methods of privacy protection exist (e.g., data perturbation and query auditing), as we have reviewed, these methods require technical expertise and are better suited for large databases.
We also recognize that anesthesia journals have a responsibility (a) to archive articles and their supplementary digital content; (b) to prevent scientific fraud; and (c) to ensure the validity of its scientific findings (see also footnote b on page 1).t,u Having access to secondary digital content may allow other researchers to replicate the original findings to establish “consistency.”v,21 Investigators often provide a rudimentary statistical analysis, state that this is sufficient, while excluding the details that would be needed to replicate their findings. Providing the data as supplemental content facilitates replication and assessment of the robustness of conclusions to different statistical methods and assumptions. Providing the data may also help other researchers to design their own experiments that test the original findings. However, making data available to a virtually unlimited number of future adversaries with originally unknown external data sources and knowledge makes the prospect of privacy protection highly unlikely.
As a reasonable compromise, we propose that anesthesia journals’ supplemental content include routinely a single record of a “representative” (hypothetical) patient, which combines the salient attributes of 3 or more patients in the study. An important feature of this representative patient would be the data structure (format), also known as data schema or meta-data, defined precisely. Publication of the article would be subject not only to (the current) affirmation of who is the archiving author but also that such author would maintain all the data in that specified data structure (format) and provide it upon request by the Editor-in-Chief for purposes of evaluating the replicability of the published study. In addition, the authors affirm that they will make the data, in that structure, available to others, provided the requesting investigator(s) obtain approval from the relevant IRBs. Our recommended policy would serve to protect the journal from the risks of making the clinical data available as secondary digital content (i.e., publishing the data), while implementing procedural safeguards to ensure transparency and protect against scientific fraud.
Why The Hospital and Procedure(s) Cannot Be Omitted
The hospital and procedure(s) cannot be omitted if the anesthesia study data are to be used by other investigators either to replicate the authors’ analyses or to motivate future research.22,23 Making data available from a journal is designed in part to ensure that investigators can evaluate covariates. Heterogeneity in the hospital and specific procedure(s) influences anesthesia workflow.22 The most influential variable for selection of the hospital where a case is performed is the procedure(s).24 This is as simple conceptually as heart transplantation is not being done at rural hospitals with 2 operating rooms open daily.13 The most influential variables for surgical case duration and its coefficient of variation are the hospital and the procedure(s).22,25 Assessments of perioperative morbidity control for the hospital, and, within a given hospital, the most influential variable predicting morbidity is the procedure.26,27 Finally, if the dependent variable is continuous, the sample size is large, and linear regression is used, then omitting an independent variable (e.g., procedure) does not change the other independent variable’s estimated coefficients or SEs. However, this does not hold for models with nonlinear link functions such as logistic regression or survival analysis.23 Consequently, omitting either the hospital or the procedure would make it infeasible to replicate the authors’ work when the dependent variable is binary and logistic regression is used or when time to event (survival) and Cox proportional hazards model is used.23 At a minimum, making the data available facilitates replication and may support testing the reproducibility (robustness) of the findings, which represents a higher standard of scientific validation than simple replication (see Table 2).21
Name: Liam O’Neill, PhD.
Contribution: This author helped design the study, conduct the study, analyze the data, and write the manuscript.
Attestation: Liam O’Neill has approved the final manuscript.
Name: Franklin Dexter, MD, PhD.
Contribution: This author helped design the study, conduct the study, and write the manuscript.
Attestation: Franklin Dexter has approved the final manuscript.
Name: Nan Zhang, PhD.
Contribution: This author helped to analyze the data and write the manuscript.
Attestation: Nan Zhang has approved the final manuscript.
Tong Yan assisted with computer programming.
Dr. Franklin Dexter is the Statistical Editor and the Section Editor for Economics, Education, and Policy for Anesthesia & Analgesia. This manuscript was handled by Dr. Steven Shafer, Editor-in-Chief, and Dr. Dexter was not involved in any way with the editorial process or decision.