This question (to share or to not share data) was the topic of the “The Changing Face of Epidemiology” symposium at the 2008 Annual Meeting of the Society for Epidemiological Research. The 2 presenters, Daniel Levy and Graham Colditz, offered their perspectives based on years of experience as principal investigators for the Framingham Study and the Nurses’ Health Study, respectively. Both studies are unique for the richness of the data collected, for the duration of follow-up, and for the myriad hypotheses still to be tested using their data; both are supported by the National Institutes of Health (NIH) and are costly. The approaches taken for data sharing, as described by Levy and Colditz, differed. For the Framingham study, there has long been data sharing via public use data sets whereas the Nurses’ Health Study has required the submission of a proposal for review by the Principal Investigator and subsequently by the study's Advisory Committee. Levy was firm in calling for open access to data collected with public funds, as long as confidentiality and privacy of data are maintained. Colditz spoke to the realities of data sharing—the need for infrastructure, the varied approaches to data sharing issues by Institutional Review Boards, and the competing interests of study investigators and their students with those seeking data access.
In this commentary, I broaden the discussion to issues that inevitably come as we enter the era of data sharing: (1) the definition of the data that are to be shared; (2) the need to develop a culture of data sharing in epidemiology; (3) the consequences of data sharing for epidemiology (as fewer new population studies are supported); and (4) the implications of data sharing for students and researchers.
First, regarding the rhetorical question posed in the title of this commentary—yes, share. Regardless of whether epidemiologists want to share data, the era of data sharing has arrived. One historical landmark was the “Shelby Amendment,” a 1-sentence amendment inserted into the 1998 appropriations bill. The Shelby Amendment led to changes in the Office of Management and Budget's Circular A-110 governing research such that data collected with federal funds and cited in a regulation have to be made available if requested under the Freedom of Information Act. The Shelby Amendment was prompted by the reliance of the Environmental Protection Agency on data from the Harvard Six Cities Study and the American Cancer Society's Cancer Prevention Study II in setting the 1997 National Ambient Air Quality Standard for particulate matter. Industrial stakeholders argued that the data should be available for analysis by affected parties because of the sweeping consequences of the proposed standard. The impetus for data sharing that originated with the Shelby Amendment has been reinforced by the culture of rapid data sharing that came with genomic research.1
As a starting point for considering data sharing in epidemiology, the nature of the data to be shared needs specification. “Data” has been broadly defined as “a collection of items of information,”2 whereas Wikipedia gives the following description: “Data (singular: datum) are collected of natural phenomena descriptors including the results of experience, observation or experiment, or a set of premises. This may consist of numbers, words, or images, particularly as measurements or observations of a set of variables.”3 Should sharing then extend to all data elements that might be collected or generated during the course of an epidemiologic study? Should sharing stop at analytical data sets? Some difficult scenarios of requests can be readily offered: raw electrocardiogram data rather than coded results; dietary consumption data rather than nutrient indices; and raw microarray data before processing to generate analytic variables. Requests could potentially extend to generated variables and related code and perhaps to simulated data sets. This listing is not meant to cause apprehension but to anticipate the kinds of requests that might be made for study “data.”
Surprisingly little has been written in the epidemiologic literature directly on data sharing. A 1991 commentary by Hogue,4 written on behalf of an ad hoc committee of the Society for Epidemiological Research, addressed ethical and other issues in data sharing. Read now, the commentary still speaks to currently active issues. It calls for data sharing but covers most of the same issues that are still perceived as barriers to doing so. The American College of Epidemiology published a policy statement on data sharing in 2002 (http://www.acepidemiology.org/policystmts/DataSharing.pdf). The statement covered the benefits and challenges of data sharing and offered principles that should underlie data sharing. The current views of academic epidemiologists have been widely influenced by the NIH, which, since 2003 has required a data-sharing plan for all grants with funding exceeding $500,000 in any year. Even so, we know little about how investigators have responded to this requirement and the consequences, both immediate and for the future of the new policy. More recently, for genome-wide association studies, the NIH is giving investigators 12 months of exclusivity before sharing data.5
Among researchers, there is now growing understanding of the implications of data sharing and of the potential effort involved in doing so. I have received requests for several decades for data sets that my colleagues and I had developed. In the past, requests came rarely and typically involved a letter, sometimes apologetic in tone for the request, and a follow-up phone call and even in-person meetings. The mechanisms for sharing data were primitive: generating a data set that might be sent via a tape to be read on a mainframe computer. Now, the requests are frequent and come via email; they arrive from all over the world and may be made by academic or nonacademic researchers, and frequently by graduate students in search of data for a thesis.
In several major studies, my collaborators have built interactive, web-based systems to address these requests. About 10 years ago, our team at Johns Hopkins (including Scott Zeger and Francesca Dominici), along with collaborators at Harvard (Joel Schwartz and Antonella Zanobetti) began a comprehensive set of time-series analyses of air pollution and morbidity and mortality, the National Morbidity, Mortality and Air Pollution Study (NMMAPS).6,7 For the mortality analyses, we used publicly available data sets on mortality, air pollution, temperature, and county characteristics. Numerous requests for data were received and consequently, with support and encouragement from the Health Effects Institute (the funder of NMMAPS), a website (iHAPSS or Internet-based Health & Air Pollution Surveillance System) was developed that made available the data as well as the statistical programs in R that had been used for analyses (http://www.ihapss.jhsph.edu/). Presently, there are about 925 hits per week to the website from 245 unique IP addresses. I was able to identify 23 publications based on the resources of iHAPSS, 18 using the data, and 5 using only the software. Subsequently, Zeger and Dominici teamed with Roger Peng and advanced the concept of “reproducible research,” based on the NMMAPS experience.8
Similarly, the Reading Center for the Sleep Heart Health Study, a multicenter study of sleep-disordered breathing and cardiovascular disease risk, developed a website for the distribution of raw data from polysomnograms—the overnight records of sleep and breathing (http://dceweb1.meds.case.edu/). This website was a response to numerous requests for these data; the website provides visitors with the opportunity to specify the characteristics of participants with data of interest. A formal process involving completion of a data distribution agreement is required. As for iHAPSS, the funding agency (NIH in this case) provided financial support specifically for website development.
These websites, NIH policies, and the statements of professional organizations are indicators of a move toward a culture of data sharing in epidemiology. In other fields, there are established traditions of data sharing.9 In fact, Hogue commented in 1992 that there were already advances in other fields. Views may be divergent on data sharing among epidemiologic researchers, as documented in part by the differing principles and approaches adopted by Framingham and Nurses’ Health Study investigators. Nonetheless, there is an immediate imperative for the formulation of principles and approaches for sharing epidemiologic data. Current drivers include the Shelby Amendment and NIH policies, as well as new models for the conduct of epidemiologic research that give increasing emphasis to large, centrally driven initiatives.
The move toward “big epidemiology” has been previously addressed by Epidemiology.10–12 For many current research questions, particularly those related to disease genetics, large populations are needed. Cohorts of 500,000 and more participants have been developed or proposed.13 The data sets resulting from such large investigations will be remarkable for their size and cost. The National Children's Study, a prospective US cohort study of 100,000 children enrolled during pregnancy or even before conception, is now in progress.14 Its existence will undoubtedly limit opportunities for individual investigators to start their own birth cohorts.
At the same time, NIH funding for new population studies is severely and increasingly restricted. Sharing of data from these new large studies is inescapable; they are being collected with public funds, their data sets will be massive and afford myriad opportunities for developing and testing hypotheses, and their existence restricts alternative opportunities for epidemiologists generally. Some models are available for planning data sharing for these large studies, but some aspects are as yet unaddressed. How will analyses based in data sharing be funded? How will priorities of the original investigators and data analysts be balanced? How will the quality of analyses arising from data sharing be assured?
For academic departments, there are further issues, some related to faculty and others to students. For faculty, how much of a commitment should be made to participate in these large, typically multicenter initiatives? Lacking involvement, would a department's faculty have less ready access to data? How will participation in such initiatives be considered in regard to academic advancement and contributions? For students, should all students necessarily be trained in data collection? Is there an emerging role for epidemiologists who will be skilled analysts of data collected in large studies? Perhaps we need to develop new roles for “primary investigators” and “secondary investigators.” Hogue's 1991 paper refers to “primary researchers” and “secondary analysts.” This dichotomy should be anticipated and may already be at hand. The Nurses’ Health Study has been the basis for more than 70 dissertations. The students have benefited from the extensive data available, which led to meaningful research, but most were not involved in any way in the design and conduct of the study. Does this experience prepare them to be primary researchers or secondary analysts?
Epidemiologists, academic entities, professional organizations, and funders need to become engaged proactively in data sharing. The topic is under wide discussion already (Arzberger et al15 2004; Borgman 20079). The community of epidemiologists needs to develop its own culture of data sharing, to address the sweeping implications of data sharing, and to engage with researchers in other fields on this issue. Formal study is needed of data sharing practices, their evolution, and their consequences for our field.
1. Foster MW, Sharp RR. Share and share alike: deciding how to distribute the scientific and social benefits of genomic data. Nat Rev Genet
2. Porta M. A Dictionary of Epidemiology.
5th ed. New York: Oxford University Press; 2008.
4. Hogue CJ. Ethical issues in sharing epidemiologic data. J Clin Epidemiol
. 1991;44(suppl 1):103S–107S.
5. National Institutes of Health (NIH). Policy for sharing of data obtained in NIH supported or conducted genome-wide association studies (GWAS). NOT-OD-07-088.
6. Samet JM, Zeger S, Dominici F, et al. The National Morbidity, Mortality, and Air Pollution Study (NMMAPS). Part 2. Morbidity and mortality from air pollution in the United States
. Cambridge, MA: The Health Effects Institute; 2000.
7. Samet JM, Zeger S, Dominici F, et al. The National Morbidity, Mortality, and Air Pollution Study (NMMAPS). Part I. Methods and Methodological Issues
. Cambridge, MA: Health Effects Institute; 2000.
8. Peng RD, Dominici F, Zeger SL. Reproducible epidemiologic research. Am J Epidemiol
9. Borgman CL. Scholarship in the Digital Age: Information, Infrastructure, and the Internet.
Cambridge, MA: MIT Press; 2007.
10. Ness RB. “Big” science and the little guy. Epidemiology
11. Kaplan GA. How big is big enough for epidemiology? Epidemiology
12. Hoover RN. The evolution of epidemiologic research: from cottage industry to “big” science. Epidemiology
13. Davis RL, Khoury MJ. The emergence of biobanks: practical design considerations for large population-based studies of gene-environment interactions. Community Genet
15. Arzberger P, Schroeder P, Beaulieu A, et al. Science and government. An international framework to promote access to data. Science