In 2004, Fleiszer and colleagues1 discussed the development and use of digital repositories for learning objects that could be referenced in online medical curricula. In this article, we consider the implementation of a global digital repository for medical education research data sets—an online site where medical education researchers would be encouraged to deposit their data in order to facilitate the reuse and reanalysis of the data by other researchers.
Key Concepts and Definitions
A repository, in this general framework, is “a networked system that provides services pertaining to a collection of digital objects.”2 The digital object is “a data structure whose principal components are digital data and key-metadata.” Digital data refers to any informative content; key-metadata (more commonly, simply “metadata”) provide information about the content or its representation, such as a unique identifier and details about how the data are formatted.
Research data repositories organize and facilitate access to research data. Figure 1 provides a schematic illustration of a data repository. Elements of raw data (E1, E2, etc.) are logically grouped together into data sets (S1, S2, etc.), each of which has associated metadata (M1, M2, etc.) describing the organization of the data, the process by which the data were collected, and other characteristics. Data sets may be logically organized into data set collections, possibly with hierarchical structure (e.g., a medical education collection within a health professions education collection within a postsecondary education collection). The repository consists of the sum of its collections, along with associated mechanisms for search and retrieval (not pictured).
An important component of most repositories is the assignment of a unique, permanent identifier to each data set in the repository, much like an ISBN number is a unique, permanent identifier for a book. The identifier provides a standard for academic citation of data sets, which facilitates publishing new analyses of data sets.
Unit of retrieval
The search and retrieval process defines a unit of retrieval, a key feature that differs among repository systems. Most research data repositories treat the data set as the unit and allow users to search for and retrieve complete data sets. This permits diversity among the data sets in their internal organization, variable naming conventions, and file format. The repository does not need to “look into” the data sets; indexing and searching operate only on the associated metadata. Other data repositories treat the individual data element as the retrieval unit and allow query and retrieval of subsets of data from a data set or even extraction of data elements across multiple data sets. These approaches are more common when all of the data sets are produced by a single ongoing data-collection process for which procedures have been standardized (e.g., the U.S. Census) or when conventions of the field specify a standard file format.
Repositories are commonly described as national repositories, institutional repositories, or domain repositories.
National repositories collect data from a particular country. Most national repositories archive either governmental publications or national data sets and statistics. Fundamentally, they represent those data that have been officially sanctioned as being products of the government.
Institutional repositories are maintained by an institution in order to preserve and distribute data produced by researchers at the institution.3,4 In 2005, Lynch and Lippincott5 reported that approximately 40% of research universities maintained some type of institutional repository. Most of these, however, were archives of publications (theses, e-prints, conference proceedings, working papers, and others), and relatively few included data sets.
Domain repositories archive and maintain data sets related to a particular academic domain, field, or specialty. These data sets may be eclectic, as is common in research data repositories, or may include predefined variables in a standard format, as is common in quality assurance data repositories (e.g., the National Database of Nursing Quality Indicators of the American Nurses Association). This facilitates collecting domain-specific data from multiple institutions, but it requires an ongoing commitment from some particular institution or consortium to maintain the archive.
Examples of Repositories
We sought out extant examples of data repositories in medicine, education, and social science that could illustrate the potentials and pitfalls of the development of a repository for medical education research. One of us (C.P.) conducted Internet searches using Google Scholar, PubMed, and CINAHL (contact us for details of the search strategy). Citations were reviewed to identify common examples of research data repositories, excluding those that were primarily bibliographic in nature or that collected human tissues, educational images, or particular quality indicators rather than research data sets.
Repositories in clinical medicine have often focused on providing institutional, local, or national data sets and statistics to support health care utilization research. Three notable U.S. examples are
- Health Data Tools and Statistics,6 a compilation of links to data sets and repositories in medicine sponsored by Partners in Information Access for the Public Health Workforce, a collaboration of U.S. government agencies, public health organizations, and health sciences libraries.
- H•CUPnet,7 published by the U.S. Agency for Healthcare Research and Quality, provides commercial access to raw hospital utilization data from a large set of statewide data-collection projects, as well as a free, online query system for retrieving summary statistics from the data.
- Health Services and Sciences Research Resources (HSRR),8 published by the U.S. National Library of Medicine, provides links to, and metadata about, clinical, epidemiological, and health services data sets, instruments, and software.
Each of these repositories takes a different approach to providing access to data. Whereas Health Data Tools and Statistics provides links to data sets, and H•CUPnet provides a query interface for generating summary statistics, HSRR is notable in that it provides not only links to data sets but also rich metadata. HSRR metadata include information on the data source, its purpose, a description of the data, restrictions on use of the data, data-collection details (interval, population demographics, geographic region, unit of analysis), a PubMed search link that may identify publications based on the data set, contact information, and a revision date.
Education and social science
Several well-known data repositories focus on social science and education.
- The Inter-University Consortium for Political and Social Research (ICPSR)9 is among the oldest data repositories, founded in 1962. In addition to providing direct access to data sets with substantial metadata, it contains valuable user support documentation for the preparation, archiving, and use of social science data.10 ICPSR currently archives a small number of data sets relevant to medical education: the Integrated Postsecondary Education Data System, which includes data in categories entitled “Salaries, Tenure, and Fringe Benefits of Full-Time Faculty” and “Higher Education Finance.”
- The Roper Center for Public Opinion Research11 specializes in archiving data sets related to public opinion, including economic issues and policy, education, elections, health issues, international news, lifestyles, polls, and polling. Data sets are available at no charge to member institutions; others can access data sets at a cost. Metadata are provided.
- The H. W. Odum Institute for Research in Social Science12 maintains vital statistics, public opinion, and social psychological data. The Odum Institute also permits individual researchers to deposit data sets in their archives.
- The Henry A. Murray Research Archive13 hosts quantitative and qualitative social science research data and permits individual researchers to deposit data sets. Its archives use the Dataverse Network (DVN) platform (discussed below). The Murray Research Archive currently contains five medical education data sets from the late 1970s and early 1980s: studies of career patterns of Harvard and Tufts medical school trainees and the Practice and Life Patterns of Women and Men Physicians study. Most data sets require an application for use in order to access them, including a one- to two-page research project description.
Arguments for a Medical Education Data Repository
Although excellent social science data repositories exist, there is no global research data repository in medical education specifically. Here, we review some reasons why such a repository would be of value and some data on perceived need for a data repository among directors of medical education research.
Data set repositories can provide permanent access to unique data sets along with metadata to enable researchers to reuse the data. Medical education research could particularly benefit from a data repository for two related reasons.
First, many educational studies conducted in medical schools and hospitals are hard to replicate. Researchers in medical education will note the difficulty of conducting any sort of large-scale educational research amid the tightly scheduled medical school curriculum, among interns and residents kept continually busy with patient-care responsibilities, or in the context of hospitals and medical centers facing regulatory and financial pressures. Accordingly, when an educational study is conducted, it is imperative that the data obtained be leveraged to maximize their scientific value. In practice, however, after publication of a manuscript describing the study and its results, study data often languish in the investigator's file drawer or computer. A data repository could facilitate the secondary analysis of these data to answer new scientific questions or provide the basis for power analyses for future studies.
Second, medical education research studies are often inherently limited in their sample sizes. Single-site studies are limited by the numbers of medical students, residents, or physicians available at the site; multisite studies are limited by the number and size of sites and the study budget. Although moderate sample sizes may well be sufficient for answering investigators' primary research questions, they may limit generalizability and the ability to study or generate secondary or mediational hypotheses. Meta-analysis and meta-regression offer promising approaches to estimating more precisely educational effects and their generalizability across contexts by combining the data from multiple studies. Reliable meta-analysis, however, depends on the availability of data sets from multiple studies testing the same hypotheses.14 A data repository for medical education could lead naturally to meta-analytic review of educational questions commonly studied by multiple investigators, such as the impact of different types of curricula on student performance or the characteristics of new assessment approaches.
An assessment of perceived need
To further assess the perceived need for, and potential acceptance of, a data repository for medical education research, we conducted an online survey of the 88 members of the Society for Directors of Research in Medicine Education (SDRME). Forty-one SDRME members completed the online survey, including eight members from institutions outside the United States (47% RR1 response rate,15 above average for online surveys16–18). The study was determined to be exempt from human subjects review by the UIC institutional review board.
Participants responded to 17 questions about data sharing and a potential data repository on a six-point Likert-type scale. Figure 2 depicts the mean responses to the survey items, with 95% confidence intervals; responses were generally normally distributed about their means. Overall, respondents strongly endorsed data sharing, with the caveat that principal investigators should choose whether or not to share data they collect. This may represent an acknowledgment of the “trustee” role of researchers endorsed by participants; respondents may feel that principal investigators in their units are best suited to judge whether, when, and how to share data. In addition, the large majority believed that a repository would benefit their unit and the field of medical education. Few respondents reported using existing repositories.
Challenges in Building a Data Repository in Medical Education
Building a domain repository for medical education data sets entails a set of related challenges. These include the development of taxonomic standards, clarification of intellectual property and human subjects protection issues, identification of a suitable technological infrastructure, and establishment of criteria for evaluation of the repository.
Metadata standards, such as Dublin Core and the Data Documentation Initiative,19 provide a structured language with which to describe data sets at the level of the study (how the data were collected, by and from whom, when, and other information), at the level of the variables in the data set (variable name, format, description), and at the level of the physical arrangement of the data set (such as files, file format, and file size). Metadata standards are a useful common framework, but to be truly useful, a medical education research repository will also require taxonomic work focused on medical education studies.1
Specifically, it will be necessary to define a set of controlled vocabularies related to medical education research to facilitate cataloging, searching, and combining data sets. These would include some general research vocabularies related to the study design and nature of data (quantitative, qualitative) as well as field-specific vocabularies for educational setting, learner level, and other aspects of medical education research studies.
The development of a useful controlled vocabulary is not trivial.20 For example, a vocabulary related to learner level (such as that of the undergraduate medical student, intern, resident, or fellow) when learners are the study subjects would clearly be important, and amply illustrates some of the complications of controlled vocabularies. Learner-level terms are hierarchically organized; for example, “M1,” “M2,” “M3,” and “M4” might be subgroups of “undergraduate medical student.” Moreover, a common set of terms must be sufficiently flexible and complete to account for differences in medical education systems internationally (e.g., four-year versus six-year medical schools, resident versus house officer versus fellow versus registrar) and for different implications of postgraduate year for different specialties (e.g., postgraduate year 4 in pediatrics versus postgraduate year 4 in surgery).
Some ongoing work in the development of vocabularies for metadata in medical education has already been undertaken by the Medbiquitous Consortium (http://www.medbiquitous.org), which defines open standards and specifications for information technology in medical education. For example, the Medbiquitous Healthcare Learning Object Metadata specification includes a “context” element with a vocabulary that includes “school,” “higher education,” “training,” “patient education,” “caregiver education,” “undergraduate professional education,” “graduate professional education,” “continuing professional development,” and similar terms.21 Although Medbiquitous focuses on the practice of education, rather than research in education, its standards may offer useful contributions to research repository controlled vocabularies.
Research databases, by virtue of the creativity involved in the selection of their contents, are protected compilations under international copyright law. Their ownership may vest in the study investigators, in their university or employer, or in the research sponsor, depending on the nature of the research and agreements made with funders.22 Each investigator—and therefore any prospective medical education research repository—must ascertain whether they are, in fact, permitted to deposit their data. This represents a significant concern, as selective data depositing may result in “publication bias” and limit meta-analyses. When ownership issues prevent investigators from depositing data or convincing the owners to do so, investigators should document the existence of the data and their inability to deposit those data (much as clinical investigators now register clinical trials in advance of data collection or publication). This will require widespread practice changes as investigators anticipate the need for data set registration; these changes could be driven in part by the editors of medical education journals.
As an example of intellectual property management, the repository hosted by the Harvard Institute for Quantitative Social Science (IQSS) sets terms for data deposit and download that apply to all collections in the network. Individual collections and data sets can add additional terms or requirements to the network terms.
Data deposit terms typically include (1) representations that the uploader has the legal authority to license the repository to archive and share the data, (2) permission for the repository to do so, (3) representations that the data are not illegal (e.g., infringing or classified) or dangerous (e.g., containing a virus or malware), and (4) representations that the data were collected with ethical approval and do not contain personal identifiers, and similar terms.
Data download terms typically include (1) agreement not to attempt to identify individual subjects in the data, (2) agreement not to download or use data for illegal purposes, (3) the requirement to attribute and cite data sets in publications arising from them, and (4) a disclaimer of warranties or obligations of the repository to the data user.
Because educational research very often involves human subjects, protection for human subjects is a requirement for a medical education data repository. Although human subjects protections vary internationally, generally accepted repository practice is to require that investigators attest that the data were collected with the approval of their local ethics committee or institutional review board and deposit only data which have been completely deidentified (allowing other researchers to download and reanalyze the data without requiring ethics review themselves). Local review boards are best suited to make ethical determinations. In some cases, they may have limitations on data sharing or requirements that research subjects prospectively consent to, or be informed of, data-sharing practices even for deidentified data.
Even when human subjects are not identifiable in archived data, institutions may be. In particular, research data are likely to provide information about the investigator's institution, and data that reflect on institutional performance (e.g., deidentified measures of student achievement or mental health) may be a cause for concern for institutions. When these institutions are data owners and must agree to allow data to be archived, they may be reluctant to do so. The field of medical education will need to consider approaches to balance the benefits of data to researchers with the concerns of the institutions.
A principal challenge in the development of data repositories in the past has been the choice and maintenance of infrastructure, including the hardware, software, and network connections that enable the repository to archive data sets and researchers to retrieve them, along with adequate safeguards for data integrity. Currently, however, several projects have made the infrastructure establishment of a new repository highly feasible at reasonable cost.
Because hardware requirements are driven by software, it behooves the designers of a new repository to consider software early. Two repository software solutions currently command the greatest attention: DSpace and DVN.
DSpace, developed by MIT and Hewlett-Packard, is used by over 360 institutions worldwide to maintain institutional data repositories.23 In 2007, over 46% of U.S. libraries with institutional repositories used DSpace.24
DSpace archives both publications and data sets and supports a rich metadata model based on well-known library metadata standards including Dublin Core.25 DSpace treats data sets as archival units. Metadata include both descriptive information about the archived item and administrative information (such as the item's provenance and authorization). DSpace can also “harvest” entries from other archives that support the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)26 and can present links to those entries as part of the collection.
DSpace is free/open source software; there is no cost for the software itself, and its license allows users to modify and redistribute the software. DSpace can run on inexpensive hardware running the Linux operating system. Most institutional DSpace installations, however, involve an outlay of at least $40,000 for server hardware and storage, according to the dspace.org “Frequently asked questions” page.
DVN is a more recent system developed by the IQSS at Harvard University. DVN was developed specifically for archiving research data, and it can subset properly formatted data sets (e.g., SPSS or Stata), allowing researchers to extract individual data elements and to perform online statistical analysis within DVN itself. A DVN is organized into dataverses, virtual collections of data sets that can be organized hierarchically and can harvest metadata from other archives that support OAI-PMH.27
DVN is also free/open source software. Installation requires at minimum two Linux servers, but the IQSS hosts a DVN that permits anyone to create his or her own dataverse without needing any local hardware. The creator is responsible for setting policies and maintaining the content in the dataverse; the IQSS maintains the software and storage.
The Center for Research Libraries28 has published a set of criteria for evaluating digital repositories. The criteria include organization, governance and accountability, content, normalization processes applied to uploaded information, technical systems and data security, cost structure and distribution, rights, results, and outputs. The criteria are derived with repository longevity in mind and seek to isolate those factors that will contribute to reliability and continuity.
The permanence of the sponsoring organization and its commitment to the stability of the repository signal intent to preserve the information therein available. Governance of the organization in a fashion that promotes mutual accountability between repository users and the organization also contributes to repository stability. Clear content-collection guidelines and reliability of continued collection determine usage of the repository. The technical infrastructure used by the repository—including security, conformance to standards, methods of backup, authentication, and scalability—are important evaluation criteria. The need for a well-thought-out cost structure is self-evident. Documentation of who holds the right to the content and what will happen in the eventuality that the content becomes prohibitively expensive should be transparent. Finally, a proven track record, that is, a digital repository that has demonstrated the aforementioned characteristics historically, predicts future satisfactory performance.28
On the basis of our review of the principles and operation of data repositories, our analysis of the need for data sharing in medical education research, and our discussion of the challenges of data set repositories, we offer the following conclusions and recommendations to medical education researchers and research institutions.
A data repository for medical education research would be beneficial for the field.
The availability of data sets in medical education will permit investigators to advance the field by conducting new analyses and meta-analyses. A repository should document and preserve unique data that are difficult to collect and should do so centrally, providing unfettered access to researchers worldwide.
The data repository should be hosted by the DVN at the IQSS at Harvard University.
The DVN software is ideally suited to host a repository for medical education research data. Designed with data sets in mind (rather than, for example, publications), it uniquely supports subsetting and on-the-fly data analysis, yielding greater efficiency and flexibility in accessing data.
Although a medical education repository could be set up by any institution willing to invest in hardware, the Center for Research Libraries standards strongly argue for leveraging preexisting technical infrastructure and permanent sponsorship. Harvard has already developed sophisticated infrastructure and has made permanent commitments to maintain and preserve the IQSS's DVN. A medical education dataverse at the IQSS would immediately benefit from proven security, backup, authentication, and scalability. Researchers will be able to focus efforts on the contents of the repository rather than its infrastructure.
The repository should be planned with ongoing evaluation in mind.
Although technical stability is essential, successful repositories must also consider the needs and demands of users.29 Accordingly, the development of a global domain repository for medical education data should include the parallel development of plans for ongoing evaluation of the repository's ability to serve its users. Such evaluation should include characterization of the targeted and actual users of the repository, the strengths and weakness of the repository's interface, the planned and actual content of the repository, and publications that arise from data sets available in the repository. In short, who uses the repository? How, for what, and with what outcomes? Naturally, the data characterizing the repository's operations should themselves be deposited in the repository.
A first step is the formation of an international development committee of stakeholders in medical education research.
Perhaps the most difficult tasks in the implementation of a medical education research data repository are the development of standards for data set documentation and metadata and assessment of the requirements for ongoing governance or stewardship of the repository. To this end, we recommend that an international development committee be formed, consisting of key stakeholders in medical education research, including representation of researchers, funding agencies, institutions, organizations (e.g., the World Health Organization, the Association of American Medical Colleges, the National Board of Medical Examiners), students, and editors of medical education journals. This committee should engage task forces to develop field-specific controlled vocabularies and metadata standards, consider protections for institutions identifiable in data, and draft policies and practices for the operation and governance of the repository. The development committee also provides an opportunity to explore the needs of stakeholder organizations and to lay the groundwork for successful collaboration.
A second step is the formation of an international managing committee.
We further recommend that formal governance of the repository be vested in an international managing committee, akin to a journal editorial board, containing representation from key stakeholders. The role of the managing committee should be to establish and review policies for operations, to conduct evaluations of the use of the repository, and to serve as a point of contact for users with questions. Decisions about data sets themselves should be decentralized—subject to repository-wide terms of operation that should ensure equitable access, individual depositors should control any additional terms under which they make their data available. A functional governing structure is absolutely essential to the widespread success of a medical education research repository. A useful model might be the Internet Corporation for Assigned Names and Numbers (http://www.icann.org), the nonprofit public benefit corporation that coordinates the Internet naming system and manages the central technological core that supports the Internet's domain name system.
Although the DVN is a free resource, the ongoing management of the repository will involve modest costs associated with the operations of the managing committee, including user support and evaluation processes. Although these costs may be initially met through volunteer labor or grant funding, long-term operating expenses will require an ongoing revenue source to be sustainable. Several support models are possible and should be considered by the committee, including stakeholder membership fees and micropayments (e.g., $1 per download) for data set access with counterbalancing credits for data deposit.
A third step is to provide strong incentives to investigators and institutions to deposit data.
There are clearly incentives for ongoing use of publicly available data to answer research questions, including the dissemination of knowledge that can improve the operation of medical schools and increased opportunities for research publications that will contribute to the careers of medical education researchers. Incentives to deposit data, however, are less tangible, and this may lead to a social dilemma, in which everyone takes advantage of a common resource, but no one contributes to its growth and upkeep.
Encouraging data deposit involves both outreach and attention to suitable rewards for contributors. Outreach efforts should focus on research institutions and individual investigators. Medical education journals may be particularly valuable in calling attention to the benefits that the field will accrue from data sharing. In addition, accrediting bodies and funding agencies can encourage or require institutions to plan for and engage in data sharing. The example of the U.S. National Institutes of Health, which mandates data sharing as a condition of funding for projects exceeding $500,000 in a year—and allows investigators to request costs associated with data preparation and documentation—might well be followed by funders of medical education research.30
Institutions can provide incentives to investigators both by identifying data deposit as a scholarly activity in promotion and tenure decisions and by recognizing that investigators should receive scholarly credit when their deposited data sets are reanalyzed and become the basis of publications by other researchers. Medical education journals can assist in this process by supporting the citation of reanalyzed data sets using the repository's unique identifier.
By motivating data sharing and reuse through a common digital repository for medical education research data sets, investigators, medical schools, and other stakeholders may see substantial benefits to their own endeavors and to the progress of the field of medical education.
The authors wish to thank members of the Society of Directors of Research in Medical Education who participated in the needs assessment survey.
Since the submission of this manuscript, Dr. Schwartz has been involved as a consultant to the American Board of Pediatrics and Association of Pediatric Program Directors in the area of designing a data repository for studying innovations in graduate medical education in pediatrics.
This study was deemed by the institutional review board of the University of Illinois at Chicago to be exempt from review under 45 CFR 46.