Rachel Murkett, PhD
Shared knowledge brings about advances in science and technology. The most common way to share scientific knowledge is through academic publication, which provides a level of quality control through peer-review; however, academic publishing is also a long, labor-intensive, and frustrating process for researchers that is prone to poor reproducibility. With the rise of information technology and data science, researchers are now placing a premium on gaining access to the core of the scientific knowledge—the data.
What happens when data is openly accessible?
One of the earliest examples of an open-access data repository is the Protein Data Bank (PDB), established in 1971. The PDB now contains digital crystal structure data for more than 150,000 proteins in a standardized, interoperable, and reusable format. The availability of this data has revolutionized all corners of protein science, from drug discovery to education. In pharmaceuticals, PDB data has played a crucial role in the development of new drugs by structure-based drug design (SBDD), whereby drug structures can be computationally tailored to the 3D protein structure of their target. This method was instrumental in the design of antiretroviral protease inhibitors, such as saquinavir, which have been associated with a major drop of AIDS-related deaths in the US since their approval, from 50,000 in 1995 to 18,000 in 1998. Saquinavir represents only one example of the impact that scientific data can have on real-world situations when it is distributed in a way that is amenable to downstream usage.
FAIR Data Principles
The FAIR Data Principles represent a consensus guide on good data management from all key stakeholders in scientific research. Published in 2016, the guidelines provide key requirements to make scientific data FAIR—findable, accessible, interoperable and reusable. The aim of the guidelines is to emulate the effects of repositories like the PDB, which already operate by these principles, in other areas of science.
While the guidelines provide no specific methods of implementation, they do state the requirements for data FAIRness. To be findable, the data and metadata must be assigned to the same globally unique and persistent identifier, and both must be indexed in a searchable resource. To be accessible, the data and metadata must be retrievable by their identifier using a protocol that is open, free, and universally implementable. To be interoperable, the data must use a formal, accessible, shared, and broadly applicable language for knowledge representation, and include qualified references to other relevant metadata. To be reusable, the data must be clearly released with an accessible data usage license, and the metadata must be richly described.
Maximizing the FAIRness of clinical data
In order to achieve FAIRness, a level of centralization is required. The objectives of making data easily findable and accessible are defeated if one needs to search hundreds of repositories for it. Furthermore, centralization can improve the interoperability and re-usability of data by having one consistent set of requirements that are applied to all datasets. Yet, the world’s largest funder of clinical research, the National Institutes of Health (NIH), currently lists 82 approved repositories for deposition of data. These vary from general purpose repositories, such as clinicaltrials.gov, to specialized repositories, such as ITN TrialShare or NIDA Data Share. This lack of centralization and standardization is a key area that warrants improvement to bring clinical data in line with the FAIR principles.
The largest open-access database for clinical trial data is clincialtrials.gov, which currently contains more than 300,000 studies across 208 countries. However, 88 percent of these studies list only general data regarding the trial, such as the planned intervention, number of patients enrolled, and sponsors. They lack data related to the trial outcomes, which limits the usefulness of this repository. A federal mandate issued in 2016 sought to change this by requiring that all NIH-funded trials comply with the deposition of summary results information. The cost of this legislation was projected to reach USD 59.6 million, incurred primarily by the additional time needed to organize, format, and submit the results—a burden that was expected to fall on the clinical researchers responsible for the data.
In 2010, the world’s largest clinical research funders and partners became the 19 co-signatories on a joint statement on sharing research data to improve public health. One of the key commitments made by the stakeholders was to maximize the ability of researchers to share their data without incurring significant cost.
To unlock the value of clinical data, there is an unmet need for increased centralization of clinical data repositories and new tools that reduce the burden of FAIR data on researchers.
The future of clinical data management
A unique aspect of the FAIR data guidelines is their recognition of machines as key stakeholders in the future of data science. The guidelines foresee a future where machines could autonomously perform analytics without requiring manual input from humans. Could machines also hold the answer to reducing the burden of FAIR data on researchers?
A key milestone to reaching this potential has been the transition to Electronic Health Records (EHRs). Over the past 10 years, the transition to EHRs has delivered advances in clinical research by decreasing costs and timelines related to data collection. Prospectively, EHRs open doors for automating data collection, organization, cleaning, and even sharing, which could further reduce the time burden on researchers. However, major challenges remain in the quality, security, and interoperability of the data entered in EHRs. Often, the formats for collecting and storing EHRs differ across institutions, requiring additional steps to integrate the data and translate from natural language to the standardized formats required by clinical data repositories.
Two key areas that will increase the FAIRness of clinical data, while reducing the administrative burden, are improvements to the interoperability of EHRs, and centralization of clinical data.
Improving the interoperability of EHRs
The Strategic Health IT Advanced Research Projects (SHARP) program was established in 2010 with the goal of improving the interoperability of data from EHRs, thus supporting exchange, sharing, and reuse of operational clinical data. The prototype platform can receive EHR data in several institutional formats, subsequently generating structured information from narrative text, and normalizing EHR data to common clinical models and terminologies. These capabilities could be highly valuable in preparing data for the standardized format of a repository and could pave the way to automating time-consuming stages of data management.
Another potential solution to the issue of data sharing costs is presented by Clinical Data Research Networks (CDRNs), which were founded as part of the National Patient-Centered Clinical Research Network. The goal of the network is to unite patient groups, researchers, and healthcare systems to support rapid and cost-effective clinical research. While the project has a broad scope, covering interventional and observational population-based research, a primary focus of the CDRNs is to create interoperable databases from EHRs collected across multiple institutions. The infrastructures developed to achieve interoperability and standardization of EHRs and patient-level clinical data could be highly relevant for reducing the data management burden on researchers.
Centralizing clinical data
Vivli is an independent, non-profit repository launched in 2018 that aims to improve the FAIRness of clinical data. To date, it holds the data for more than 3,600 clinical trials. The platform functions like a broker of de-identified patient-level data, submitted by its 17 members, to researchers within the scientific community. Like the previously mentioned CSDR platform, Vivli provides a level of security through the application process required to access the data. Beyond the services of other repositories, Vivli also provides a secure research environment containing tools to perform analysis and integrate data from other platforms. The capability to import, integrate, and manipulate data within the repository represents a significant step in the centralization and interoperability of clinical trial data sharing.
The potential impact of big data and digital platforms on healthcare is immense. From improving clinical data collection to supporting future trial design to facilitating new discoveries, there is a precedent for significant leaps across many medical fields. This potential, however, rests on the availability of high-quality data that is findable, accessible, interoperable and reusable, and the tools that will streamline clinical data management. To use the words of Atul Butte, chief data scientist at the University of California Health System, “Hiding within those mounds of data is knowledge that could change the life of a patient, or change the world.”