Ask the Expert: Choosing the Right IT System and Data Infrastructure

Alexander Sherman, director of systems in the Department of Neurology and director of strategic development and systems at the Neurological Clinical Research Institute at Massachusetts General Hospital, discusses why the right choice and setup for data collecting, data handling, and data sharing infrastructure is important to help establish successful research collaborations. He emphasizes the need to think early and clearly about standards and nomenclature for collecting and labeling samples and subsequent data, to enable linking and analyzing them when they have been collected at different locations at various time points.

Q: Please tell me about your collaborative research network focusing on rare diseases.

A: My interests lie in rare diseases, and in order to find a cure for such diseases, the research community has to collaborate. Hence, I am interested in learning how technology, processes, and know-how can help move research forward. In order to get the pharmaceutical industry interested in such diseases, we need to create networks of academic institutions and hospitals to run clinical trials with appropriate patients, looking at the right outcome measures. The Amyotrophic Lateral Sclerosis (ALS) consortium started with several hospitals in New England and now it has become a global network with research sites in Europe and Canada. We have a scientific committee to help us agree on what to do, which new drugs and approaches to look into, and which companies to work with. Our focus is on clinical—not basic—research, so it is important for people in the lab to align themselves with the right team in the clinic that will eventually move their research forward.

Q: How important is it for the data infrastructure in clinical research to be set up in a way that fosters collaboration?

A: We have to think first about all the different types of data we can possibly collect from patients. It could be various types of clinical data, clinical images, or collections of biofluids and tissues. We have to think about these different types of heterogeneous data sets and see what needs to be done to link them all together. To this goal, we have introduced the concept of virtual biorepositories. As you know, specimens from patients with rare disorders are very valuable. Hence, in an academic setting, people compete for the same resources and the same glory, and they don’t like to share. So we changed the paradigm and suggested to the ALS community that they don’t have to share anything; they simply register their biospecimen with the network. The technology we developed allows researchers within the network to search for a particular biospecimen, track it to a specific individual or institution, and then approach them directly. This is a way for the research community to collaborate without any obligation to share. However, in order to analyze the collected data and biofluids, we have to agree upon and enforce certain standards for collecting, storing, and labeling biosamples to guarantee their quality. You have to think ahead on whether and which data can be shared and a priori agree on the common data elements (CDEs) to capture and the nomenclature of biosamples to be used. We also have to agree on what to put on a vial label, as it not only identifies your sample but also links it back to the data. Careful planning is a must.

Q: Why are the CDEs & biospecimen nomenclature so important?

A: If every lab collects data in different ways, you will not be able to harmonize your data and perform meta analyses on the aggregated data. So it’s important to agree a priori on the data collection and nomenclature. Currently, the National Institutes of Health (NIH) is attempting to develop disease-specific CDEs to provide a framework for collaboration. Some data elements are not disease-specific, but a majority of data elements, such as functional scales and questionnaires, are specific to certain diseases or disease areas and have to be measured and collected in a certain manner for the data to be shared later. This is not ideal, but it is a first step. While trying to gather data for the ALS network, we found that one of the untapped resources is data from research projects and clinical trials that exist at various pharma companies. With a nonprofit company, Prize4Life, and generous support from the ALS Therapy Alliance, we started a project to create a Pooled Resource Open Access ALS Clinical Trials (PRO-ACT) database that allows the merging of data from Phase II and Phase III trials from public and private sources. But we had some challenges as the data was collected over a period of 20 years, by multiple companies and institutions, with no standard for data collection in place. Some of the data sets arrived without data dictionaries. In some cases, we had to go into the disease histories and do some reverse engineering to figure out basic patient information. To harmonize this data, we have designed and developed a platform that allows us to build a common data structure (CDS), map individual data sets to the CDS, and then import the data according to the maps.

Q: Do the systems and tools used for collecting the data need to be standardized?

A: It is sometimes important to know which analytical tool was used to make certain measurements as results may vary. It is less important to know what tool was utilized to collect and store the data, as long as one can identify and harmonize the data from the various databases. There have to be standard operating procedures (SOPs) in place on how to collect, enter, and clean the data. The data capture system has to be smart enough to interact at the point of entry to prevent inaccurate data from being entered. You have to think about data quality before starting the collection effort. Ideally, an independent data management team will plan ranges for data fields, create queries, and work with individual researchers. Just collecting data does not guarantee its quality.

Q: What about long-term data storage and protocols for disaster prevention and recovery?

A: It’s important to back up all your data and know what to do with it. Should it be in the same location where the original data is kept? If it is, then you are protecting against theft or system damage but not from a natural disaster like an earthquake or a fire. Should the backup copy then be sent to a faraway location or a safe deposit box? With newer approaches like storing data in the cloud and in virtual storage sites, you don’t have to worry about data synchronization. The concern, however, is around data security and access. Data backup is also important for auditing purposes and one should not forget to periodically restore the backups to ensure that they’re being backed up correctly. You have to think about all possibilities and procedures. Finally, people who collect, capture, back up, and monitor data all need to be properly trained and their training records need to be maintained.

The image illustrates the concepts of Virtual BioRepository, in which databases and biospecimen storage units are decentralized, while accessible for searches to the research community. (Source: Sherman et al. Amyotroph Lateral Scler. 2011)