The merits of data sharing and transparency in epidemiology have been debated for several years.1–4 In 2015, the Centers for Open Science produced a set of guidelines aimed at scientific transparency,5 and as of this writing, no major epidemiology journal has become a signatory.6 One important criticism of these guidelines is their rigid, one-size-fits-all nature, focusing on reproducibility rather than advancement.3 Yet we cannot have advancement without transparency. Further, the data that epidemiologists use are often real patient data, with the exception of simulation studies, and are less amenable to sharing unless they fulfill public use data set requirements. Calls for more transparency in science often focus on these data sets;7 for example, the AIDS Clinical Trial Group maintains publically accessible data files.8 If the work is funded by the National Institutes of Health, data sharing may be required as part of the grant.9
Aside from the data themselves, the analytic computer codes are just as useful, perhaps more so, in that they allow incremental additions to methodologies in addition to reproducibility. Not as frequently mentioned, they also are useful as educational resources, especially for appreciating how complex methods are implemented. This focus on advancement through peer collaboration is at the heart of the open-source software movement. Recognizing the benefits of disclosure of the analytic codes, voluntary initiatives are occurring within the broader scientific community,10 as well as within epidemiology,1 but little guidance exists for epidemiologists who wish to share their code. For example, EPIDEMIOLOGY requests upon article submission a “description of the process by which someone else could obtain […] computing code,” but specific guidance is not offered.11 Acknowledging that more epidemiologists will opt-in to releasing their work over time, this commentary discusses mechanisms to release code in a voluntary manner that protects the authors and has potential to advance open-source epidemiology.
In the case of an analysis from a data set containing patient data, the only content that the researcher can reasonably be expected to share is the analytic code or its deidentified by-products. The data, which may be owned by the institution or patient, a contested area in work derived from medical records,12 would need to be thoroughly vetted by an institutional review board before being released to the public. If the data are simulated and represent an artificial patient population, both the source code and data sets can potentially be shared when there is no risk of patient information being released. Fitted statistical models and other intermediate data products resulting from analytic code execution are also candidates for release, provided they do not contain the source data. These analytic by-products aid in transparency, reproducibility, and education.
Regardless of whether the data are shared, in some cases portions of the analytic code (even variable names) may be proprietary or property of a third party. Before the computing code is published, it is up to the researcher to ensure that no protected health information, proprietary, or copyrighted information that could cause harm to an individual or business is released. This may require some form of redaction prior to sharing.
PROTECTING INTELLECTUAL PROPERTY
When writing computing code for analysis, the work may be copyrightable, in that as the author of the code we have the right to “reproduce, adapt, distribute, perform, and display the work”.13 We may then choose to license this work to other individuals for their use, using an open-source licensing scheme. The exact type of protection required depends on several factors, including the use of third party code, funder stipulations, and institutional responsibilities.
Analyses conducted using commercially available statistical software, such as SAS (SAS Institute, Cary, NC), Stata (StataCorp, College Station, TX), or SPSS (IBM, Armonk, NY), that do not include third party utilities may only need licensing if the researcher wishes to protect intellectual property, particularly the attribution, use, and misuse of the code. Analyses conducted in an existing open-source platform, such as R (R Foundation for Statistical Computing, Vienna, Austria) or that include third-party code may require licensing before public dissemination to protect existing intellectual property and enforce existing open-source licenses.
There are a variety of licenses available along with comprehensive resources for comparing the schemes.14–17 For most purposes, the Berkeley Software Distribution or Massachusetts Institute of Technology license is a simple and concise license that has few restrictions on the use of the analytic code, other than including the license and copyright notice in the code. The GNU General Public License (GPL) provides a mechanism for inherited licenses, for example, when the analytic code employs an existing software library or package that was itself released under GPL.
If the work was unfunded scholarly research, it is straightforward to obtain one of these open-source licenses and release the analytic code to the public. If the work was done for hire or as sponsored research, there may be stipulations on what can and cannot be done with the code, as well as the use of third party utilities. In general, work funded through a government agency, such as the National Institutes of Health or the National Science Foundation, will be more amenable to open-source principles. Nongovernmental sponsored research may have more stringent data sharing requirements spelled out in the service agreement, while work done for hire may be the most restrictive in terms of data sharing, and the epidemiologist may need to obtain written permission prior to releasing the analytic codes. Epidemiologic work done by government employees falls under the public domain, and therefore, the analytic code can easily be disseminated as open source. Ultimately, it is incumbent upon the researchers to engage their respective institutions to discuss approaches to sharing analytic code.
HOSTING ANALYTIC CODE
There are a bevy of existing mechanisms for publishing computing code. The researchers’ institutions may wish to host the content on personal or departmental pages, the journal may accept code as supplemental online content, or researchers may have their own website where the content can be hosted. If the analytic code is stand-alone software, for example, a package that can be loaded into the R platform, the software’s vendor or distributor may have a hosting option available for potential developers, in this case the Comprehensive R Archive Network.18 One of the more successful open-source hosting platforms, GitHub (GitHub, Inc, San Francisco, CA), has a version control platform, which tracks code changes, in addition to access control, automated licensing options (including Massachusetts Institute of Technology and GPL), and collaborative solutions. There are also joint initiatives specific for sharing code and data underlying research publications, such as www.runmycode.org or www.zenodo.org.
There are several key benefits to hosting one’s code with a dedicated hosting platform. First and foremost are the maintenance aspects. The researcher can perform incremental changes to the code as methodology advances or bugs are discovered, as well as deprecating code when superseded by superior methods. Second, given the collaborative nature of epidemiology, these platforms allow multiple users to work off a single code base. Third, journals may enforce strict formatting requirements that are not amenable to computer code.
When code is well defined, one of the aforementioned solutions works well because the user can take advantage of the many tools offered by the hosting platforms. But in many cases, epidemiologic code may be short snippets of statistical tests, especially if there were few data management steps necessary. Even in these less structured or ad hoc analyses, codes can still be shared via one of the mechanisms mentioned and will likely be of greatest benefit from an educational perspective. If the underlying data are public or generated from a simulation study, the data themselves are worth sharing; the described solutions can successfully host data files in addition to the source code.
Irrespective of what is shared or where it resides, the analytic code must be usable to other researchers. This includes adhering to “best practices” such as meaningful comments and external documentation as warranted; clear and unambiguous variable and file names; and logically simple constructs. For a more thorough yet approachable treatment of these practices, the reader is referred to this citation.19 Additionally and ideally, the epidemiologist would be available for providing per diem assistance. The time invested for maintenance and support may not be trivial; yet, to advance open-source epidemiology, it is not enough to simply make code accessible—it must also be usable.
DISCLOSURE OF THE ANALYTIC CODE
There are two possible options for disclosure of the analytic code, depending on whether the research has been published or presented. If it has not been published, authors can readily include a reference to the code either as supplemental materials or in the article itself. This can be as easy as including a footnote or line at the end of the Methods section stating, “Source code for analyses available at <https://github.com/goldsteinepi>.” If, on the other hand, the research has been previously published or presented, the code can still be shared retroactively, but it will be more difficult for the audience to locate this resource. By having a web or social media presence, one can include a permanent link to that researcher’s repository of analytic code, for example, if in academia, on the faculty member’s web page. As an alternative to using a typical web address (i.e., uniform resource locator or URL) for disclosing code, researchers may prefer using a digital object identifier (or DOI), commonplace in academic referencing. As opposed to a URL potentially changing over time, the DOI is a persistent identifier. Obtaining a DOI and making the analytic code citable is possible depending on the chosen hosting platform.20
A VOLUNTARY APPROACH
For this initiative to gain additional traction, it should use an opt-in approach for sharing one’s work rather than being required upon submission. There are several reasons why an individual, or their employer, may not wish to share the analytic code. First, an institution may recognize a value to the code itself, for example, a novelty that warrants patent protection. The GNU GPL provides a mechanism for patent protection, and once a patent is obtained, the software can be licensed for public use, if desired. A second reason for not sharing code may be concerns over loss of control. While it is easy to obtain a license to protect the integrity of the analytic code, enforcing the license may prove difficult. An individual may intentionally or inadvertently modify the code to produce spurious results, potentially invalidating future work. However, if the modified code is also published as open source, perhaps this error can be spotted and corrected. A third reason may be related to the investigator’s willingness. Recognizing that many hours go into analysis, and perhaps combined with the fear of having an error, may dissuade some individuals from promoting their work. To some extent, this falls under the realm of peer review, and a potential solution to this concern is to include a separate review process specific to the analytic code, as has been proposed.10
There are equally important reasons for embracing open-source principles. Transparency begets reproducibility and allows subsequent methodologic advancement. Cross-collaboration is inherent in science, and allowing our work to flow unfettered across institutions can propel the field. One such example, the Montreal Neurological Institute and Hospital, which recently become the first open-science institute in the world, foresees accelerated innovation, participation, and implementation of clinical research by removing existing data barriers.21
A variety of mechanisms exist for sharing our analytic code in an open-source epidemiology framework, and doing so has the potential to advance our field. This commentary has focused on the educational aspects and incremental methodological improvements, but there may be other equally important reasons such as transparency and reproducibility. For those who decide to release analytic codes, its by-products, or data, there are several considerations including protecting intellectual property, hosting, and disclosure. Working with epidemiologists’ respective institutions can ensure protection for all parties.
I thank Raymond Wildman, US Army Research Laboratory, for helping to conceptualize this article.
ABOUT THE AUTHOR
NEAL D. GOLDSTEIN is an epidemiologist at Christiana Care Health System (Newark, Delaware) and holds a faculty appointment in the Dornsife School of Public Health at Drexel University (Philadelphia, PA). He has an extensive experience in epidemiological analyses from secondary data sources, particularly electronic health records. His research spans several disciplines, including vaccine-preventable diseases, sexual minority health and HIV, pediatric infectious diseases, and women’s health surrounding pregnancy. He writes a science blog, which is available at www.goldsteinepi.com/blog.