Publicly available datasets are an attractive resource for epidemiologists. But, as it turns out, there is a price to be paid for this “free” information—a price that investigators may find uncomfortably high.
Epidemiology recently received a submission based on NHANES data. (NHANES comprises a vast collection of health surveys, clinical examinations, and laboratory data collected on samples of U.S. households. Data on tens of thousands of people are as close as a click of your Web browser.) 1
When we sent this paper for review, we received some unexpected news: one reviewer informed us that he was coauthor of a similar paper published just 2 months before, using the same dataset and drawing the same conclusions. We had little choice but to reject the manuscript, on the grounds that its contribution over the published paper was negligible.
This situation is not to be confused with duplicate publication by a single group of authors. The problem here is not malfeasance, but wasted effort. Analyzing data and writing manuscripts require an epidemiologist's best skills. The effort and time required are our scarcest resources. In retrospect, had these authors known that others were already working on the problem, they would probably have chosen to spend their efforts elsewhere.
This problem is not new—although it may be unfamiliar to many epidemiologists. Scientists in other areas of research are often in direct competition with their colleagues. The first to identify a new mathematical proof or a new biochemical pathway wins the prize. In contrast, epidemiology thrives on replication. It is the accumulation of consistent observations that gives us confidence in our findings—but this assumes different groups of epidemiologists are working on independent sources of data. With the availability of publicly accessible data, epidemiologists are thrown into a new game. By all signs, access to epidemiologic data is likely to expand, 2 and cases of unexpected duplication of effort are likely only to increase.
What is the solution? And who should be responsible for it? In the case of NHANES, the sponsoring agency might consider providing space on the NHANES Web site where investigators could announce their ongoing projects. Such a forum would allow researchers to give public notice of their projects. There are no doubt other complexities that must be taken into account. And, of course, there are times when independent replication of results within a given dataset may be a good thing. But most researchers would probably want to know they are replicating other work, rather than discover it after the fact. Investigators should be able to make informed decisions about how they devote their research time.
We recognize that the issues of public datasets go far beyond this particular problem. There are perplexing and far-reaching questions being raised by the movement to encourage data sharing. Our purpose here is simply to point out what seems to be a new problem, and one that is likely to worsen.
Research datasets represent enormous public investment. The same can be said for the training and employment of skilled epidemiologists. When an epidemiologist's work is wasted, it is a loss. It would be an even greater loss if the exploration of publicly accessible data sets were to become too much of a risk to take.
1. Centers for Disease Control and Prevention. NIH Announces Draft Statement on Sharing Research Data
. Available at: http://www.cdc.gov/nchs/express.htm
. Accessed 17 May 2002.