Big Data’s Role in Drug Discovery

New Tools and Applications Move Beyond a "One Target, One Drug" Approach

Projecting over the next decade, scientists at the University of Illinois, Urbana-Champaign, suggested in a commentary published in the journal PLOS Biology in 2015 that these data could end up swamping all currently available storage, based on the expected 100 million to 2 billion human genomes that could be sequenced by 2025. And their estimate didn’t take into account transcriptome, epigenome, or proteome data, let alone the massive storage requirements of medical images.

Making sense of all this data has become big business, with more than 200 data analytics companies vying for the attention of those engaged in biomedical research. Many others are working on open source big data solutions.

Related Article: How Big Data Can Save Lives

Some are even making a star turn, like the Scripps Translational Science Institute, which recently hired Paul DePodesta, baseball data analytics guru canonized in the movie Moneyball. DePodesta is lionized for his ability to pull hidden relationships from baseball statistics that lead to a winning formula. Similarly, the expectation is that advanced analytics can turn up novel relationships in biological data that are impenetrable through ordinary statistical means, like time-honored logistic regression analysis. An investment in data analytics is expected to usher in the next wave of drug discovery, providing more solid drug candidate leads and helping treat previously intractable diseases such as Parkinson’s, Alzheimer’s disease, and other complex neurological disorders.

But just how would this work?

The multi-institutional, multidisciplinary Big Data for Discovery Science (BDDS) Center is on the front lines of this new data-intensive frontier. The research team includes computational scientists used to dealing with the massive data processing needs of Argonne National Laboratory, operated by the University of Chicago for the U.S. Department of Energy, as well as members of the Institute of Neuroimaging and Informatics and the Information Sciences Institute, University of Southern California, and the Institute for Systems Biology, Seattle. “You are dealing with many hundreds of terabytes of data,” says Arthur W. Toga, PhD, director of Laboratory of Neuro Imaging (LONI), Keck School of Medicine, University of Southern California, and co-principal investigator of BDDS. “Hosting that locally is a serious investment, and you need to be committed to doing that. Only some companies or organizations have the resources to do it. You can use centralized cloudbased services, but if the computational complexity of whatever you are doing is high, there has to be a proximity between the processing resources and the storage resources because you can’t move it around very effectively.…It just doesn’t work.”

The BDDS has more than six petabytes of storage and 6,000 processors to facilitate exploration of its collections, which include neuroimaging, genomic, and microscopy data. The BDDS team is working on a user-friendly system to manage and manipulate big datasets. To meaningfully integrate biomedical data, algorithms have to handle and integrate disparate data sources, says Toga. The validity of underlying analytic algorithms is critical to big data analysis.

“You can find relationships in variables that are totally nonsense,” says Toga. “Big data necessitates big responsibility as well. People need to employ appropriate statistical controls for random effects to prevent spurious findings from making you go in the wrong direction.”

Related Article: Big Data Mining

He also points out that incompletely described data has little value. Any data repository should insist on only accepting materials with full descriptions of specific aims and instruction on how it might be used in other areas of research.

While an ever-expanding array of data portals and data-as-a-service and software-as-a-service cloud-based providers proliferate, these services raise issues of data governance and reproducibility.

Today’s discussion around big data is mainly hype around the potential for discovering new and better medicines in the long term. It may be more instructive to look at how investigators are putting big data to work for them right now to build biologically relevant models and reevaluate the usefulness of potentially promising drugs already under development.

Building a Virtual Animal Model

One of the most critical steps in evaluating any potential medical treatment involves testing its toxicity in animal models. Too often drugs that look promising in animal models prove otherwise when tested in humans, resulting in costly setbacks in drug development. Substitutes to animal testing, such as human cell culture and in vitro assays, have not been validated as alternatives. Yet big data technology offers the possibility of providing validated alternatives to some outmoded animal tests.

Investigators in the National Institute of Environmental Health Sciences (NIEHS) are using data analytics and machine-learning techniques that combine datasets from disparate chemical and biological assays to show that a virtual model can in some cases better predict toxicity than traditional animal testing. An early test case involved combining data from several in vitro tests for measuring estrogen receptor bioactivity. This strategy, led by scientists at NIEHS’ NTP Interagency Center for the Evaluation of Alternative Toxicological Methods (NICEATM), and the U.S. Environmental Protection Agency Office of Research and Development, has been accepted as an alternative to three existing Tier 1 tests in the EPA’s Endocrine Disruptor Screening Program. The NICEATM computational toxicology program, led by deputy director Nicole Kleinstreuer, PhD, is now focusing on combining in chemico and in silico datasets to evaluate non-animal alternatives for skin sensitization testing, a critical step in evaluating ingredients used in dermatology and cosmetics.

Related Article: Biologist Developing 3-D Computer Model of a Full Animal, Cell by Cell

The NICEATM group uses a data pipeline tool called KNIME, an open source data analytics platform, along with open-source programming languages R and Python. Kleinstreuer’s group uses machine-learning techniques to build biological models with defined end points to help ensure biological relevance, and importantly, runs both test sets and validation sets to ensure reproducibility. They also build probabilistic models using a Bayesian network approach, which can assign likelihoods to predictions.

The NICEATM group has specialists in both bioinformatics and chemistry informatics, says Kleinstreuer. The knowledge of chemistry is very important for putting the data in context when dealing with biochemical assays, she says. For instance, one cell-free assay used in their lab gave beautiful dose-response curves, but further investigation showed that the surfactant added to the assay was denaturing the protein of interest, thereby generating results that were an artifact.

The group recently published their research in Nature Biotechnology demonstrating that computational approaches could provide accurate predictions of skin sensitivity using human primary cells. They found that batteries of in vitro tests, combined with machine-learning computational models, could outperform animal testing approaches to skin toxicity testing.

New life for an old drug

For big data to make a big impact in drug discovery, pharmaceutical companies have to break out of the “one target-one drug” approach that remains the dominant paradigm, says Lei Xie, PhD, a structural systems pharmacologist at Hunter College, City College of New York. Xie develops algorithms that use machine learning to build biologically relevant network models that combine data from in vitro and in vivo tests and, importantly, can predict future outcomes as well as incorporate new information in a continuous learning loop.

“We can use big data as a constraint to limit our search space, and build a model that recapitulates the underlying mechanisms to understand how drugs work and develop better drugs,” he says.

Related Article: Teaching an Old Drug to Maintain its Tricks

As a case study published in 2016 in the journal Scientific Reports, Xie’s group developed a network analysis that combined structural, functional, and genomic interaction data to identify biologically relevant targets for metformin, a decades-old diabetes drug now being investigated for its anti-cancer properties. The algorithms used publicly available datasets to produce a network model of metformin’s biological activity. Crucially, the research team validated the computational approach by verifying the interaction between metformin and most of the targets.

Xie points out that their approach could be used with any compound with at least one protein receptor for which a crystal structure is available and for which a genome-wide gene expression analysis has been performed.

Getting more out of your current data

For labs interested in making more productive use of the data they already have at hand, new tools are available now that have proven useful in integrating datasets for hypothesis generation. Based on the premise that a picture is worth a thousand words, or in this case, a million data points, the data visualization firm Tableau has seen explosive growth among life sciences and pharmaceutical clients. The company, formed by Pat Hanrahan, founding member of the movie animation company Pixar, and one of his graduate students at Stanford, offers a myriad of visualization options. Users with no formal training in data science can import and combine data from multiple sources and display it in a visually intuitive dashboard.

Getting more out of the scientific literature

Even though it is publicly available, published research found in MEDLINE, the ubiquitous search engine used by virtually everyone in biomedicine, isn’t being utilized to its fullest because the tools available to plumb its depths are based on simplistic Boolean operators that return lists of publications. Using three converging technologies—data analytics, advanced visualization, and the Semantic Web, a data sharing framework—scientists have developed a search algorithm that presents results in an interactive graph of words that visually connects key relationships. The software, called Oak Ridge Graph Analytics for Medical Innovation (ORiGAMI), is the result of a collaboration between the National Library of Medicine and Oak Ridge National Laboratory data scientist Sreenivas Rangan Sukumar, PhD.

Related Article: Advancing Big Data Science

“Humans’ limited bandwidth constrains the ability to reason with the vast amounts of available medical information,” Sukumar stated in the announcement of the new search feature. “By design, ORiGAMI can reason with the knowledge of every published medical paper every time a clinical researcher uses the tool. This helps researchers find unexplored connections in the medical literature. By allowing computers to do what they do best, doctors can do better at answering health-related questions.”

The tool has already assisted investigator Georgia Tourassi, PhD, director of ORNL's Health Data Sciences Institute, in his research investigating the link between environmental exposures and lung cancer.

“All of us have those moments of epiphany when certain thoughts click into our head and we move on to explore hypotheses deeper,” Tourassi says. “This tool enables that serendipity.”