Strategies for De-identification and Anonymization of Electronic Health Record Data for Use in Multicenter Research Studies
Kushida, Clete A. MD, PhD*; Nichols, Deborah A. MS*; Jadrnicek, Rik†; Miller, Ric‡; Walsh, James K. PhD§; Griffin, Kara MA§
*Stanford Sleep Medicine Center, Redwood City, CA
†Microflow DBMS Inc., Aalapapa Drive, Kailua, HI
‡Microflow DBMS Inc., Sausalito, CA
§Sleep Medicine and Research Center, Chesterfield, MO
Supported by a contract from AcademyHealth. Additional funding was provided by AHRQ 1R01HS1973 [Comparative Outcomes Management with Electronic Data Technology (COMET) Study].
The authors declare no conflict of interest.
Reprints: Clete A. Kushida, MD, PhD, Stanford Sleep Medicine Center, 450 Broadway Street, MC 5704, Pavilion C, 2nd Floor, Redwood City, CA 94063-5704. E-mail: email@example.com.
Background: De-identification and anonymization are strategies that are used to remove patient identifiers in electronic health record data. The use of these strategies in multicenter research studies is paramount in importance, given the need to share electronic health record data across multiple environments and institutions while safeguarding patient privacy.
Methods: Systematic literature search using keywords of de-identify, deidentify, de-identification, deidentification, anonymize, anonymization, data scrubbing, and text scrubbing. Search was conducted up to June 30, 2011 and involved 6 different common literature databases. A total of 1798 prospective citations were identified, and 94 full-text articles met the criteria for review and the corresponding articles were obtained. Search results were supplemented by review of 26 additional full-text articles; a total of 120 full-text articles were reviewed.
Results: A final sample of 45 articles met inclusion criteria for review and discussion. Articles were grouped into text, images, and biological sample categories. For text-based strategies, the approaches were segregated into heuristic, lexical, and pattern-based systems versus statistical learning-based systems. For images, approaches that de-identified photographic facial images and magnetic resonance image data were described. For biological samples, approaches that managed the identifiers linked with these samples were discussed, particularly with respect to meeting the anonymization requirements needed for Institutional Review Board exemption under the Common Rule.
Conclusions: Current de-identification strategies have their limitations, and statistical learning-based systems have distinct advantages over other approaches for the de-identification of free text. True anonymization is challenging, and further work is needed in the areas of de-identification of datasets and protection of genetic information.
The use of medical records and human tissues in biomedical research in the United States is covered under the Standards for Privacy of Individually Identifiable Health Information (usually referred to as the Privacy Rule), and The Common Rule. In response to a congressional mandate in the Health Insurance Portability and Accountability Act of 1996 (HIPAA), the Department of Health and Human Services (HHS) issued the HIPAA Privacy Rule regulations in December 2000. The Privacy Rule permits covered entities (ie, health plans, health care clearinghouses, or health care providers who transmit health information in electronic form in connection with a transaction for which HHS has adopted standards) to use and disclose data that have been removed of patient identifiers without obtaining an authorization and without further restrictions on use or disclosure because data removed of these identifiers are no longer protected health information (PHI) and, therefore, are not subject to the Privacy Rule. There are 18 “safe harbor” data identifiers under the Privacy Rule that constitute the minimal set of removed identifiers. The Common Rule sets the basic principles for protecting patients from research risks, using human tissues in support of medical research, and guiding the activities of Institutional Review Boards (IRB).
The use of data removed of patient identifiers is one of 3 current options available to investigators desiring to use medical data in research, besides obtaining informed consent from their patients or a waiver of informed consent from their IRB. The processes by which a data custodian prepares, manages, and distributes a dataset that does not contain individually identifiable information to a data recipient is referred to as de-identification or anonymization (Table 1). De-identification of medical record data refers to the removal or replacement of personal identifiers so that it would be difficult to reestablish a link between the individual and his or her data. Although a de-identified dataset may contain an encrypted patient identifier with which authorized individuals could relink a patient with his or her dataset, this dataset must not contain data that will allow an unauthorized individual to infer a patient’s identity from the existing data elements. Anonymization refers to the irreversible removal of the link between the individual and his or her medical record data to the degree that it would be virtually impossible to reestablish the link. With IRB approval, an individual can be reidentified from a de-identified record, but this is not the case for an anonymized record. However, it is of concern that de-identified and even anonymized patient datasets could potentially be reidentified. For example, based on experiments using 1990 US Census summary data, Sweeney1 reports that 87% of the population in the United States reported characteristics that likely made them unique based only on 5-digit zip code, sex, and date of birth.
As the use of electronic health records (EHRs) has progressively increased, concerns have been raised about their utility to fundamentally improve the quality of patient care and the threat of unauthorized disclosure of PHI either unintentionally or by identity theft. In addition, biomedical research is becoming increasingly dependent on the access, sharing, and management of EHR among clinical and research centers, especially those involved in observational and multicenter research studies, in particular comparative effectiveness research (CER). The American Recovery and Reinvestment Act of 2009 included $1.1 billion for CER to support research assessing the comparative effectiveness of health care treatments and strategies. As observed by others, with this influx of support for CER combined with the digitization of medical and biological information, we appear to be closer than ever to providing health care’s stakeholders with much needed evidence of the comparative effectiveness of treatments. The use of de-identification and anonymization strategies in multicenter research studies, and especially CER, is critically important, as this type of research typically involves large-scale environments encompassing multiple institutions, and these strategies provide a scalable way for sharing medical information in these environments while also protecting the patient privacy. Below we review published strategies and techniques specifically developed for de-identification and anonymization of EHRs and comment on their strengths and limitations.
Database Sources for Literature Search
* BIOSIS Previews [through Thomson Reuters Institute for Scientific Information (ISI) Web of Knowledge, 1926–present]
* Cumulative Index to Nursing and Allied Health Literature, by EBSCOhost (CINAHL, 1937–present)
* Inspec (by Thomson Reuters ISI Web of Knowledge, 1898–present)
* Medical Literature Analysis and Retrieval System Online (MEDLINE,1950–present)
* SciVerse Scopus (1823–present)
* Web of Science (by Thomson Reuters ISI Web of Knowledge, 1898–present).
The following keywords were used to identify prospective articles: de-identify, deidentify, de-identification, deidentification, anonymize, anonymization, data scrubbing, and text scrubbing (Fig. 1). In 3 of the databases (BIOSIS Previews, Inspec, Web of Science), a wildcard (eg, *,$) was placed in substitution of the hyphen in “de-identify” and “de-identification” to allow the database search tool to correctly recognize these keywords.
Search Criteria and Strategy
Articles were included if they were published up to June 30, 2011 and there was no restriction on earliest date of publication (ie, earliest date obtained in search was 1996). Through the combined database search, 1798 prospective citations were identified (Fig. 1). Duplicate citations, citations that were not relevant to the topic, and citations for nonrelevant article types (eg, reviews, opinions, editorials, or commentaries) were excluded. Abstracts of citations that appeared relevant were reviewed, and those that were either not relevant to the topic or outside of the medical records domain were excluded. In addition, abstracts were excluded if the corresponding article was written in a language other than English, but there was no exclusion for research conducted in other countries and it should be recognized that their strategies do not necessarily need to conform to HIPAA and other US rules.
A total of 94 full-text articles met the criteria for review and the corresponding articles were obtained; search results were supplemented by review of 26 additional full-text articles through extraction of relevant citations from the references of articles obtained through the searches. This produced a total of 120 full-text articles reviewed; 75 articles were excluded as they were not relevant to the topic of de-identification or anonymization strategies or because the given strategy lacked sufficient detail to understand or interpret it. Articles that focused specifically on anonymization strategies were included only if they had been tested on a medical population; articles encompassing anonymization models (eg, k-anonymity), primarily based on generalization methods, were not included in this document. The chair of the writing group (C.K.) conducted the above process; however, 5 other members of the writing group (J.W., R.J., R.M., D.N., K.G.) independently reviewed the 120 full-text articles obtained after the abstracts review. Differences between the reviewers’ judgments regarding inclusion or exclusion of articles were resolved by discussion; consensus was required from all 6 reviewers. Thus, the full text of 120 articles was reviewed and resulted in a final sample of 45 articles that met inclusion criteria for review and discussion in this document.
Manual de-identification of PHI from free text in EHR can be tedious, costly, time-consuming, inaccurate, and unreliable. Neamatullah et al2 reported that resident clinicians could de-identify at a rate of about 18,000 words or 90 incidents of PHI per hour. Automated de-identification approaches to de-identify free text in EHR have been developed, and there are various ways of classifying these approaches. One method is to segregate heuristic, lexical, and pattern-based systems versus statistical learning-based systems. The heuristic, lexical, and pattern-based systems rely on matching precompiled, manually constructed sets of word lists, dictionaries, regular expressions, and heuristics to remove occurrences found in the free text portions. Statistical learning-based systems typically use an annotated training set of examples to search for a statistical pattern of specific features, learning how to identify PHI from the data itself.
An example of a heuristic, lexical, and pattern-based system is deid, which is an automated Perl-based de-identification software package that uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI dataset that includes doctors’ names and years extracted from dates (Table 2).2 The de-identification process involves scanning the free-text EHR (eg, nursing notes, discharge summaries, x-ray reports) line-by-line, and dividing them into individual words; PHI is then identified using dictionary-based look-ups and pattern-matching using regular expressions. Simple heuristics qualify or reject ambiguous terms as PHI. The final step in the process is the replacement of each instance of PHI with a tag to indicate its corresponding category. This system had equal or better performance than manual de-identification, with an average recall performance of 96.7% on a gold-standard corpus of reidentified nursing notes. This system has also been adapted for regional use in Sweden,30,33 France,27 and ON, Canada.34 In addition, other heuristic, lexical, and pattern-based systems use strategies such as algorithms employing rules, dictionaries, look-up tables, word list-matching, and pattern-matching either singly or in combination5,7–13,15,16,24,32,45; semantic category recognition approaches20; numerous detection algorithms competing in parallel to recognize a specific entity (eg, fields such as first name, last name, street address, date)3; natural language processing tools to identify and remove PHI4,28,29; and disambiguating clinical text containing jargon and acronyms using rules that analyze surrounding words followed by data scrubbing.19
By contrast, an example of a statistical learning-based system is Stat De-id, which uses support vector machines to identify the class of individual words as one of 8 categories (doctor, location, phone, address, patient, identification, hospital, or non-PHI), and uses features of the target, as well as features of surrounding words to capture contextual relationships.26 Syntactic information extracted from the Link Grammar Parser 47 was used to create a representation of local context, which aids de-identification despite PHI including out-of-vocabulary words, and even when PHI are ambiguous with non-PHI within the same corpus. Semantic information from medical dictionaries was used to complement the syntactic information. Stat De-id was evaluated using 889 discharge summaries, and the system achieved an F- measure of 98% with a precision of 99% and a recall of 97% (average of all PHI types). Other statistical learning-based systems contain unique features such as an algorithm that estimates the probability that an assigned candidate patient name reference satisfies a set of semantic selection restrictions6; an SVM that was trained with syntactic and lexical contextual clues, as well as ontological information from Unified Medical Language System21; a Named Entity Recognition system based on Conditional Random Fields (ie, discriminative probabilistic model that is shown to be effective in labeling natural language) for extracting identifying and sensitive attributes25; and a toolkit that combines a web-based graphical annotation tool with modules for document processing, redaction and resynthesis, and engine, and evaluation of the performance of the PHI classifier.31,35
The American Medical Informatics Association sponsored an automated de-identification challenge in 2007 as part of the i2b2 (Informatics for Integrating Biology and the Bedside) project.48 A corpus of medical discharge summaries (training set of 669 reports and test set of 220 reports) was used by 7 teams of investigators using heuristic and/or statistical de-identification strategies. Statistical learning systems using rule templates as features performed best, with the best performing de-identification system used 2 existing toolkits (Carafe and LingPipe) for Named Entity Recognition complemented with regular expressions that could capture the more standardized PHI, followed by hybrid systems of rules then machine learning.23 The 7 systems had recall rates of 80%–96% with specificities of 83%–97%.48
HIPAA guidelines require that for data to qualify as sharable under the Privacy Rule “safe harbor” regulations, one of the identifiers that must be removed is “full facial photographic images and any comparable images.” However, there is a need for maxillofacial surgeons and other clinicians to view and share these images to improve treatment and for educational purposes. To address this challenge, a method that uses blended facial composites to de-identify photographic facial images was developed.42 These composites were created from a library of frontal facial images of white men obtained from a secure image repository. At least 2 original facial images were selected, based on a subjective analysis of facial characteristics and an expected best fit, and these images were combined into a single facial composite in 10–15 minutes. Participants strongly agreed (83%) or agreed (17%) that the composites were clinically realistic patient images, and 83% rated the composites as more effective at de-identification than traditional methods. However, there is a concern regarding the degree to which the composite alters the original image thereby decreasing its realism and clinical usefulness.
Aside from full-face images, facial features are also important when recognizing a familiar individual, so researchers have the added complication of removing identifying facial features from morphometric scans because of concerns that these images can be used to identify individuals. The significance of this problem is exemplified by 1 study that used 3-dimensional volumetric reconstructions of magnetic resonance (MR) image data, in which participants were requested to pair these reconstructions with one of 40 photographs; 40% of the participants were able to successfully match photographs with MR images.49 The need to de-identify facial features, however, must be balanced against methods that affect either the quality of the image or the removal or distortion of brain tissue and/or other anatomic features under study. Automated skull-stripping algorithms (ie, for removal of noncerebral tissue in T1-weighted MR brain images) are commercially available that could be used for de-identification purposes, but subject population and scanner performance during data acquisition may degrade the image quality of the region of interest and prove unreliable for multicenter studies that require large-scale, automated de-identified processes for image sharing.36 A newer approach uses an automated “defacing” algorithm that uses models of nonbrain structures for removing only identifiable facial features from MR volumes. This defacing algorithm did an effective job of removing facial features without sacrificing brain tissue, could be performed relatively quickly (approximately 25 min on a dataset of 342), did not interfere with subsequent data processing, and in some cases, improved the quality of subsequent automated skull-stripping by removing more nonbrain tissue.
Besides full-face de-identification, skull-stripping, and defacing algorithms, other investigators have developed approaches for Digital Imaging and Communications in Medicine (DICOM) data, which involve either substituting a new de-identified DICOM header39; altering PHI instances (eg, removing PHI by setting the value to null or emptying their content, substituting dummy or new, de-identified values)37,40; using reversible anonymization by encryption and early removal of identifying data37; or using threshold detection in an algorithm to calculate region variances of intensity values for each pixel in the image to separate PHI and annotations from background and anatomic structures.41 Unfortunately, quantitative data on the performance of these algorithms were not reported in these studies, so it is difficult to draw conclusions on their efficacy in de-identifying these data.
The management of identifying data linked to biological samples is becoming increasingly important as multicenter clinical trials and CER require methods to process samples obtained from diverse locations and stored in biobanks. A method of de-identification of biological samples for genetic research uses a third-party encryption method developed by deCODE genetics and the Data Protection Commission of Iceland (DPC).43 Physicians create a population-based list of the patients with a particular disease for the DPC; this list is reversibly encrypted by the DPC such that each social security number (SS) is converted to an alphabet-derived character string (PN). The laboratory receives the PN list and phenotype classification and compares the list against an encrypted population-based genealogy database, all prepared by the DPC. A list of patients of interest for a genetic study is sent to the DPC, and the list is decoded. Patients are contacted by physicians, and those willing to participate have their blood drawn and placed in tubes with a temporary-coded sample number (SN). The SN is scanned into a computer and the patient’s SS is keyed in the presence of the patient, establishing a SS-SN link. The DPC officer encrypts the list of SS-SN to PN-SN, establishing the link of SN with PN. The PN-SN list is sent on a sealed computer diskette along with the blood to the laboratory. At the laboratory, the blood tubes are scanned into a sample storage program and the temporary SN is replaced by relabeling with an in-house SN, and it will remain with the DNA sample isolated from the blood and is directly linked by the storage program to the PN used to label individuals within the database. Thus, the only connection between samples or data in the laboratory and the patient or data on the clinical side is through the PN and the DPC, the sole keeper of the encryption code. This third-party encryption method takes about 3 hours for a list of 1000 individuals to be encrypted or decrypted, and the cost of the system is related to the labor of the actual encryption and decryption by a trusted third party or representative.43
Furuta et al46 reported a different approach for deidentifying the blood samples in their biorepository. This repository contains approximately 250,000 samples with an average influx of 90,000 samples per year of which approximately 80% need to be de-identified. The de-identification process involves the transfer of samples from clinical identifier-tagged tubes collected from various clinical test settings to new repository identifier-tagged tubes without clinical patient identifiers; these samples are then transferred to the repository for final storage at −20°C. This process differs from data scrubbing patient identifiers on the physical sample as a tube transfer procedure is used for sample de-identification. However, this is a manual de-identification procedure that is subject to human error. To minimize such risk, the investigators manage a small number of tubes (eg, 12 tubes) in each maneuver and have different personnel verify tube identifiers.
Both of these approaches for managing the identifiers linked with biological samples, and other approaches reported by Hara et al44 and Roden et al,45 do not appear to be anonymization strategies, although the description provided by Furuta et al46 does not specifically indicate whether the samples can be reidentified. Nevertheless, it appears that all of these strategies could be modified so that the samples are anonymized rather than simply de-identified. This is especially important for research conducted in the United States as tissue-related research activities may require IRB approval and data use agreements despite the HIPAA exemption for record de-identification, as these activities are subsumed under the Common Rule and de-identification may not meet the anonymization requirements needed for IRB exemption under this rule.
The review of these strategies uncovers key questions:
Are Current De-identification Strategies Effective?
There is no question that current de-identification strategies have impressive recall and precision rates. However, no existing system is perfect, and there is the possibility that certain PHI will not be de-identified. However, all instances of PHI are not equal, with some identifiers (eg, name) more critical than others. In addition, limitations of many current systems include an inability to detect misspellings, typographical errors, and proper names that share characteristics with non-PHI (eg, the family name “Black”); restrictions in managing only certain types of data (eg, discharge summaries); algorithms that are not designed to handle diverse PHI (eg, hard-coded or embedded PHI in device-generated output files); and difficulty in compensating for regional or geographic variation in nomenclature. The challenge is to balance the levels of de-identification that are acceptable to the patients, research participants, clinicians, researchers, institutions, and federal requirements (all of which may not necessarily be similar) with operational factors related to time, cost, and labor.
Which Strategies are Best?
In the case of de-identification of free text, both heuristic, lexical, and pattern-based systems versus statistical learning-based systems have their advantages and disadvantages. For the former, studies evaluating these systems have reported good performance (especially precision) but experienced domain experts must spend significant amount of time and effort developing, organizing, maintaining, and extending the rules, which likely need to be modified for different datasets. For statistical learning-based systems, they can be used “out of the box” with minimal redevelopment time and learn how to identify PHI from the data itself rather than relying on precompiled, manually constructed sets of data. These latter systems may often serve a preprocessing role for de-identification, relying on subsequent manual processes to complete the de-identification steps missed by the automated system.23 Annotated training data typically need to be produced specific to the particular type of EHR data to be de-identified, but software tools can enhance the process.50 For both images and biological samples, there are too few studies with a paucity of quantitative data to judge the best approach, and the latter data category has the added Common Rule anonymization requirements needed for IRB exemption that do not appear to be satisfactorily addressed by the current approaches. Finally, these strategies will continue to improve as the HHS is in the process of revising the rules protecting research subjects and the HHS Office for Civil Rights has held recent conferences on de-identification standards and HIPAA privacy.
How Essential is Anonymization?
In theory, anonymization is important as it places the patient’s or research participant’s right to privacy as the top priority in any anticipated or unanticipated scenario, and dramatically minimizes the release of sensitive information that may discriminate or stigmatize the individual from a social or economic perspective. In practice, it still may be possible to identify an individual from supposedly anonymized datasets, especially with respect to rare diseases within a specific geographical area. In addition, the inability to reidentify individuals with their data may hamper the ability of investigators to conduct further studies on given individuals and pose bioethical challenges regarding the inability of clinicians to inform patients of results uncovered in their studies that may be relevant to their future health or well-being. Strategies that use pseudonymization rather than true anonymization may help to resolve these constraints, as they allow the data to be associated with a patient only under specified and controlled circumstances.
Do De-identification Strategies Alone Meet the Needs of Multicenter Research Studies, Including CER? This question has 2 components:
* Besides the de-identification of individual documents, what can be done to ensure the privacy of datasets? Heuristic methods for data de-identification are frequently used to comply with the HIPAA Privacy Rule; these include creation of a de-identified dataset or limited dataset, or generation of variables to replace identifiers (HIPAA de-identification relinking fields must not use hash codes, although limited datasets may) as long as the codes prevent identification of individuals by the data recipient. There are approaches that use 1 or more of these methods, including, for example, systems have been developed in which the user can request specific fields and records and a database is generated with information matching the anonymity level set by the user with respect to the recipient profile. In addition, it has been argued by Ferris and others that to protect PHI and to construct a useful clinical research database, a hybrid system approach utilizing secure key escrow, de-identification, and role-based access for IRB-approved researchers may be preferable to offer flexible control of PHI while meeting the needs of biomedical researchers.
* What approaches can be used on a multicenter level to ensure patient or participant privacy? De-identification and anonymization strategies comprise 1 important component of an integrated data collection and management system used in multicenter research studies and CER. In addition, some institutions use honest brokers, which collect and provide data to research investigators in a manner whereby it would not be reasonably possible for investigators to identify the participants directly or indirectly. Honest brokers are not part of either the clinical or research team as the honest broker links research identifiers and clinical identifiers. Honest brokers or coordinating centers of multicenter studies may also use universal identifiers or a master patient index to associate various patient identifiers across disparate systems, thus enabling the exchange of data between participating EHR systems and centers.
What is an Important Area for Further Work?
One critical area is the management of identifiers for the protection of genetic information, particularly with respect to protecting the privacy of identities to which DNA sequences were derived. This area of genomic privacy is particularly challenging for the biomedical community, given the immense quantity of data that needs to be processed, stored, and shared, as well as the consequences that identifying genomic data may have on an individual’s health, employment, and insurance status.
1. Sweeney L Computational Disclosure Control: A Primer on Data Privacy Protection. 2001 Cambridge, MA Massachusetts Institute of Technology
2. Neamatullah I, Douglass MM, Lehman LW, et al. Automated de-identification of free-text medical records. BMC Med Inform Decis Mak. 2008;8:32–47
3. Sweeney LCimino JJ. Replacing personally-identifying information in medical records, the scrub system. Beyond the Superhighway: Exploiting the Internet with Medical Informatics. Proc AMIA Annu Fall Symp. 1996 Washington, DC:333–337
4. Ruch P, Baud RH, Rassinoux AM, et al.Overhage JM Medical document anonymization with a semantic lexicon. Converging Information, Technology, and Health Care. Proc AMIA Symp. 2000 Los Angeles, CA:729–733
5. Miller R, Boitnott JK, Moore GW. Web-based free-text query system for surgical pathology reports with automatic case de-identification. Advancing Pathology Informatics, Imaging, and the Internet (APIII 2000). Arch Pathol Lab Med. 2001;125:1011
6. Taira RK, Bui AA, Kangarloo H. Identification of patient name references within medical documents using semantic selectional restrictions. Proc AMIA Symp. 2002:757–761
7. Thomas SM, Mamlin B, Schadow G, et al.Kohane IS A successful technique for removing names in pathology reports using an augmented search and replace method. Biomedical Informatics: One Discipline. Proc AMIA Symp. 2002 San Antonio, TX:777–781
8. Berman JJ. Concept-match medical data scrubbing. How pathology text can be used in research. Arch Pathol Lab Med. 2003;127:680–686
9. Douglass M, Clifford GD, Reisner A, et al.Murray A Computer-assisted de-identification of free text in the MIMIC II database. Computers in Cardiology. 2004 Chicago, IL American Society for Clinical Pathology:341–344
10. Gupta D, Saul M, Gilbertson J. Evaluation of a de-identification (De-Id) software engine to share pathology reports and clinical documents for research. Am J Clin Pathol. 2004;121:176–186
11. Fielstein EM, Brown SH, Speroff TFieschi M. Algorithmic de-identification of VA medical exam text for HIPAA privacy compliance: preliminary findings. MEDINFO 2004. 2004 Amsterdam IOS Press:1590
12. Douglass MM, Cliffford GD, Reisner A, et al. De-identification algorithm for free-text nursing notes. Computers in Cardiology. 2005 Lyon, France Institute of Electrical and Electronics Engineers:331–334
13. Sweeney JP, Portell KS, Houck JA, et al. Patient note de-identification using a find-and-replace iterative process. J Healthc Inf Manag. 2005;19:65–70
14. Aramaki E, Imai T, Miyo K, et al. Automatic de-identification by using sentence features and label consistency. i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data, 2006
15. Beckwith BA, Mahaadevan R, Balis UJ, et al. Development and evaluation of an open source software tool for de-identification of pathology reports. BMC Med Inform Decis Mak. 2006;6:12–21
16. Dorr DA, Phillips WF, Phansalkar S, et al. Assessing the difficulty and time cost of de-identification in clinical narratives. Methods Inf Med. 2006;45:246–252
17. Guo Y, Gaizauskas R, Roberts I, et al. Identifying personal health information using support vector machines. i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data, 2006
18. Hara K Applying a SVM based chunker and a text classifier to the Deid Challenge. i2b2 Workshop on Challenges in Natural Language Processing for Clinical Data, 2006
19. Pestian JP, Itert L, Andersen C, et al. Preparing clinical text for use in biomedical research. J Database Manage. 2006;17:1–11
20. Sibanda TC. Was the patient cured? Understanding semantic categories and their relationships in patient records. Electr Eng Comput Sci. 2006 Boston, MA Massachusetts Institute of Technology:107
21. Sibanda T, He T, Szolovits P, et al. Syntactically-informed semantic category recognition in discharge summaries. AMIA Annu Symp Proc. 2006:714–718
22. Szarvas G, Farkas R, Busa-Fekete R. State-of-the-art anonymization of medical records using an iterative machine learning framework. J Am Med Inform Assoc. 2007;14:574–580
23. Wellner B, Huyck M, Mardis S, et al. Rapidly retargetable approaches to de-identification in medical records. J Am Med Inform Assoc. 2007;14:564–573
24. Friedlin FJ, McDonald CJ. A software tool for removing patient identifying information from clinical documents. J Am Med Inform Assoc. 2008;15:601–610
25. Gardner J, Xiong L HIDE: an integrated system for Health Information De-identification. Computer-Based Medical Systems, 2008 CBMS’08 21st IEEE International Symposium on Computer-Based Medical Systems. Jyvaskyla, Finland; 2008
26. Uzuner O, Sibanda TC, Luo Y, et al. A de-identifier for medical discharge summaries. Artif Intell Med. 2008;42:13–35
27. Grouin C, Rosier A, Dameron O, et al. Testing tactics to localize de-identification. Stud Health Technol Inform. 2009;150:735–739
28. Morrison FP, Li L, Lai AM, et al. Repurposing the clinical record: can an existing natural language processing system de-identify clinical notes? J Am Med Inform Assoc. 2009;16:37–39
29. Morrison FP, Sengupta S, Hripcsak G. Using a pipeline to improve de-identification performance. AMIA Annu Symp Proc. 2009;2009:447–451
30. Velupillai S, Dalianis H, Hassel M, et al. Developing a standard for de-identifying electronic patient records written in Swedish: precision, recall and F-measure in a manual and computerized annotation trial. Int J Med Inform. 2009;78:e19–e26
31. Aberdeen J, Bayer S, Yeniterzi R, et al. The MITRE Identification Scrubber Toolkit: design, training, and assessment. Int J Med Inform. 2010;79:849–859
32. Huang LC, Chu HC, Lien CY, et al. Embedding a hiding function in a portable electronic health record for privacy preservation. J Med Syst. 2010;34:313–320
33. Dalianis H, Velupillai S. De-identifying Swedish clinical text—refinement of a gold standard and experiments with conditional random fields. J Biomed Semantics. 2010;1:6–15
34. Tu K, Klein-Geltink J, Mitiku TF, et al. De-identification of primary care electronic medical records free-text data in Ontario, Canada. BMC Med Inform Decis Mak. 2010;10:35–41
35. Yeniterzi R, Aberdeen J, Bayer S, et al. Effects of personal identifier resynthesis on clinical text de-identification. J Am Med Inform Assoc. 2010;17:159–168
36. Bischoff-Grethe A, Ozyurt IB, Busa E, et al. A technique for the de-identification of structural brain MR images. Hum Brain Mapp. 2007;28:892–903
37. Bland PH, Laderach GE, Meyer CR. A web-based interface for communication of data between the clinical and research environments without revealing identifying information. Acad Radiol. 2007;14:757–764
38. Onken M, Riesmeier J, Engel M, et al. Reversible anonymization of DICOM images using automatically generated policies. Stud Health Technol Inform. 2009;150:861–865
39. Clark KW, Gierada DS, Marquez G, et al. Collecting 48,000 CT exams for the lung screening study of the National Lung Screening Trial. J Digit Imaging. 2009;22:667–680
40. Gonzalez DR, Carpenter T, van Hemert JI, et al. An open source toolkit for medical imaging de-identification. Eur Radiol. 2010;20:1896–1904
41. Zhu Y, Singh PD, Siddiqui K, et al.Liu BJ, Boonn WW An automatic system to detect and extract text in medical images for de-identification. Medical Imaging 2010: Advanced PACS-based Imaging Informatics and Therapeutic Applications. 2010 San Diego, CA International Society for Optics and Photonics
42. Engelstad ME, McClellan M, Jacko JA, et al. De-identification of facial images using omposites. J Oral Maxillofac Surg. 2011;69:3026–3031
43. Gulcher JR, Kristjansson K, Gudbjartsson H, et al. Protection of privacy by third-party encryption in genetic research in Iceland. Eur J Hum Genet. 2000;8:739–742
44. Hara K, Ohe K, Kadowaki T, et al. Establishment of a method of anonymization of DNA samples in genetic research. J Hum Genet. 2003;48:327–330
45. Roden DM, Pulley JM, Basford MA, et al. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin Pharmacol Ther. 2008;84:362–369
46. Furuta K, Yokozawa K, Takada T, et al. De-identification procedure and sample quality of the post-clinical test samples at the bio-repository of the National Cancer Center Hospital (NCCH) in Tokyo. Jpn J Clin Oncol. 2011;41:295–298
47. Sleator D, Temperley D Parsing English With a Link Grammar. Technical Report CMU-CS-91-196. 1991 Pittsburgh, PA Computer Science Department, Carnegie Mellon University
48. Uzuner O, Luo Y, Szolovits P. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc. 2007;14:550–563
49. Prior FW, Brunsden B, Hildebolt C, et al. Facial recognition from volume-rendered magnetic resonance imaging data. IEEE Trans Inf Technol Biomed. 2009;13:5–9
50. Mayer J, Shen S, South BR, et al. Inductive creation of an annotation schema and a reference standard for de-identification of VA electronic clinical notes. AMIA Annu Symp Proc. 2009;2009:416–420
This article has been cited 2 time(s).
Science Translational MedicineTo Share or Not To Share: That Is Not the QuestionScience Translational Medicine
de-identification; anonymization; electronic health record
© 2012 Lippincott Williams & Wilkins, Inc.
Highlight selected keywords in the article text.