Anonymizing Personal Identifiers in Genetic Epidemiologic Studies : Epidemiology

Secondary Logo

Journal Logo


Anonymizing Personal Identifiers in Genetic Epidemiologic Studies

Wjst, Matthias

Author Information
Epidemiology 16(1):p 131, January 2005. | DOI: 10.1097/01.ede.0000147167.61502.8a
  • Free

To the Editor:

Data in genetic epidemiologic studies impose major challenges with regard to autonomy and privacy of study participants.1 A high level of data protection is important for public acceptance of such studies, while too many restrictions can make genetic data useless for research.2

Individual records are usually anonymized with a unique key that can be used to link addresses if participants need to be recontacted or to add follow-up data.3 As this key is the critical factor for anonymity, separate clinical and research locations have been suggested, with keys for record linkage stored with a trusted third party. Although this procedure is effective in protecting privacy, it is not very practical if immediate data access is needed.4 Furthermore, in most European countries (in contrast with the United States) there is no waiver system that prohibits data access in criminal proceedings. A release of the keyfile would therefore result in compromising identities of all study participants.

I developed an alternative procedure while working with a clinical partner who sent me his DNA samples together with clinical data. As I did not want to either delete all names or store them unencrypted on my computer, I replaced all names with their corresponding hash code. A hash function H is a transformation that takes an input string x and returns a fixed-size string, which is called the hash value h— that is H(x) []. Hash functions are often used in cryptography. Input can be of any length, while output has a fixed length. H(x) is relatively easy to compute for any given x; it is one-way and usually collision-free. A hash function H is said to be one-way if it is hard to invert, which means that given a hash value h, it is computationally infeasible to find some input x such that H(x) = h. H is said to be collision-free if it is not possible to find a string y not equal to x such that H(x) = H(y). The hash value h can represent concisely longer messages or documents from which it was computed. One can think of it as an improved “checksum” or “digital fingerprint.” A well-documented example of a hash function is the MD5 algorithm that was developed by Ronald L. Rivest of MIT in 1991 [∼mabzug1/cs/md5/md5.html]. It is an update of the earlier MD4 algorithm with improved security. As an input it takes a message of arbitrary length and produces as output a 128-bit hexadecimal string. As a matter of caution, resulting hash codes should replace original columns only with unique key restraints that do not allow for the same entry.

The personal identifier I prefer is a combination of “first name-last name-date of birth” of an individual participant where the order, capitalization, connecting characters and date format may vary among studies. Epidemiologic datasets with these hash codes can have many genetic and clinical variables without being compromised by a keyfile.

For later record linkage, a hash code may be recalculated from nonanonymized address records and queried against all hash codes in the already anonymized record set. If an identical hash code is found, both records belong to the same individual. In other words the key to unlock privacy carries its own legitimization. The proposed procedure seems to work well for case-control studies but not as well for family-based studies, in which the necessary family identifier always compromises the datasets.5

Matthias Wjst

Gruppe Molekulare Epidemiologie; Institut für Epidemiologie; Neuherberg / Munich, Germany; [email protected]


1. Marshall PA, Rotimi C. Ethical challenges in community-based research. Am J Med Sci. 2001;322:259–263.
2. Burgess MM. Beyond consent: ethical and social issues in genetic testing. Nat Rev Genet. 2001;2:147–151.
3. Gill L, Goldacre M, Simmonds H, Bettley G, Griffith M. Computerized linking of medical records: methodological guidelines. J Epidemiol Community Health. 1993;47:316–319.
4. Hunter AGW, Sharpe N, Mullen M, Meschino WS. Ethical, legal, and practical concerns about recontacting patients to inform them of new information: The case in medical genetics. Am J Med Genet. 2001;103:265–276.
5. Laberge CM, Knoppers BM. Rationale for an integrated approach to genetic epidemiology. Bioethics. 1992;6:317–330.
© 2005 Lippincott Williams & Wilkins, Inc.