Lab Manager | Run Your Lab Like a Business

INSIGHTS on Big Data in Drug Discovery

Big data might bring more benefits to drug discovery than to any other field. For one thing, discovering a new drug turns out to be incredibly difficult. On average, a pharmaceutical company tries about 10,000 drug candidates for every one that ends up on the market. Plus, the process of discovering and developing a new drug costs hundreds of millions of dollars and takes more than a decade—some say more for both measurements.

Mike May, PhD

Mike May is a freelance writer and editor living in Texas.

ViewFull Profile.
Learn about ourEditorial Policies.
Register for free to listen to this article
Listen with Speechify

Scientists at Berg start with human tissue samples, collect many kinds of data on them, and then use advanced algorithms and supercomputers to find crucial differences between healthy and diseased samples.Image courtesy of Berg


A Combination of Computations and Simulations Will Change Tomorrow's Health Care

To make this entire process more efficient and economical, pharmaceutical scientists want to find the most promising drug candidates—ones that are the most likely to be safe, effective, and affordable. Some experts believe that large datasets, and knowing how to make the most of them, can create a more targeted approach to discovering tomorrow’s drugs.

The concept of big data covers a wide range. For this article, let’s just think of big data as a complex dataset that could be tricky to handle, and what is “big” for one application could vary considerably from the next. Also, many modern tools—such as next-generation sequencing (NGS) platforms that speed up and reduce the cost of gathering information about someone’s genome— pump out data at a rate that was unimaginable even a few years ago. That makes more data than ever available, and the volume grows every second. Consequently, it’s easier to get the data than it is to make the best use of it. That could be the most complex part of applying big data to drug discovery.

Get training in Lab Crisis Preparation and earn CEUs.One of over 25 IACET-accredited courses in the Academy.
Lab Crisis Preparation Course

When asked about the key benefits of applying big data to drug discovery, Niven Narain—cofounder, president, and chief technology officer at Berg, a biopharma company in Framingham, Massachusetts—says, “Patient and disease biology are both pretty complex.” He adds, “Trying to distill the disease into a neat hypothesis-driven scientific explanation often falls short of what the true story is.”

With big-data tools, pharmaceutical scientists hope to learn more about human biology and disease, plus which drugs could do the most good.

Adding Intelligence

“It’s easy now to create big data,” says Narain. Turning those data into an actionable endpoint for a physician, researcher, or patient is the challenge. “That takes an analytical platform,” he says.

Narain and his colleagues developed such a platform based on artificial intelligence. This system starts with human tissue samples and analyzes them with genomics, clinical data, and so on. Then, this platform explores the data for patterns in healthy and diseased samples, in search of the key differences. In addition, this technology can look at samples of a patient’s disease over time. Doing all this takes the power of a supercomputer, because this platform often deals with as many as 14 trillion data points for one tissue sample. To do that, Berg runs some of its own high-performance computing clusters, or sometimes purchases computing time from Amazon. “Amazon has become so inexpensive to use that we mix and match ours with theirs,” says Narain.

In addition, Narain says, “We have the capability to put this data together to explore potential drug targets or look for diagnostic biomarkers.” The targets can then be used in drug discovery to find a compound that can attack key aspects of a disease. The biomarkers can be used in many ways, from tracking the efficacy of a drug to stratifying a patient population in a clinical trial.

Ayasdi’s Topological Data Analysis (TDA) reveals hidden relationships in complicated data sets, which can be used to develop new treatments.Image courtesy of AyasdiNarain and his colleagues call this proprietary approach interrogative biology. In short, they are interrogating healthy and diseased biology, as well as the resulting data, and combining that information and more to, for one thing, try to find the best drug for a very specific disease. For instance, Berg’s scientists developed a model of Parkinson’s disease from samples of people from 40 to 90 years old and beyond. The database is stratified by age, gender, and response to L-dopa, a drug that can slow the disease in some cases. “From this,” says Narain, “we can show the various stages of the disease. We can look at it from a molecular basis.” That could lead to the discovery of new drugs that change how physicians treat Parkinson’s disease.

Other organizations can also benefit from this tool. To get access to Berg’s technology, a pharmaceutical company or major university can, for example, license one of Berg’s drug targets or drug candidates to work in collaboration.

Sampling Shapes

Other companies are also developing sophisticated tools to discover new drugs. In Menlo Park, California, for example, scientists at Ayasdi use topology—the study of shape—to analyze data, which provides a way for scientists to find subtle, often hidden relationships in complex datasets.

As an example, Ayasdi’s scientists applied TDA (topological data analysis) to a gene-expression breast cancer dataset collected more than ten years ago at the Netherlands Cancer Institute (NKI). “We found insights within minutes using TDA and advanced machine learning,” says Devi Ramanan, Ayasdi’s head of collaborations. “We identified a previously unknown subgroup of oncology survivors who exhibited particular characteristics—genetic indicators of poor survivors—which will allow us to better understand this group and potentially help improve survival rates for this disease, which might potentially help us find a cure.”

To expand the use of TDA, Ayasdi develops collaborations with other organizations. In these collaborations, Ayasdi provides software, training, and support. “You don’t need to be a data scientist or computer scientist to use our software,” says Ramanan. “You only need to understand your data.” At the time that Ramanan talked with Lab Manager, Ayasdi had collaborations with more than 40 organizations.

According to Ramanan, researchers can use TDA to find crucial insights that previously eluded them due to the complexity of their data. Ayasdi’s collaborators, for example, have frequently been able to leverage more traditional techniques to verify and extend their initial findings using TDA.

Text To Treatment

As the previous examples show, the use of big data in drug discovery often requires a combination of fields. Scott Spangler—now a principal data scientist at IBM Watson Innovations in Almaden, California—got started, unknowingly, on such a combination when he was doing text mining in the pharmaceutical industry. This included chemical analysis, like mining text for different names used for the same formula.

Now he uses some similar thinking with IBM Watson, which is the computer that defeated two former Jeopardy! champions in 2011. Now IBM applies Watson to health care, among other things. For example, in collaboration with Baylor College of Medicine in Houston, Texas, IBM aimed Watson at molecular biology. Spangler says, “We spent a few years training Watson to understand biology— to think in terms of the physical objects, like the chemicals, and targets, like proteins in a cell.” He adds, “We also taught it to understand the disease or condition that you are trying to alleviate with the drug.” In addition, this technology can help pharmaceutical scientists pick the best drug targets.

Scott Spangler, principal data scientist, IBM Watson Innovations, demonstrates how IBM Watson cognitive technology can visually display connections in scientific literature and drug information. Here, Watson displays protein pathways that can help researchers accelerate scientific breakthroughs by spotting linkages that were previously undetected.Image courtesy of Jon Simon/Feature Photo Service for IBMTo take on these challenges with computation, scientists need large datasets. As Spangler says, “Biology is very hard and very statistical.” He adds, “You can’t think of it like computer science, with a direct cause and effect. It doesn’t work that way.” A molecular process that works one way now could work differently an hour from now or in a different person. This creates very noisy data.

That’s exactly what Watson tackles. It tries to quantify confidence. As Spangler asks, “How certain are we of various conclusions? Where can we find corroborating evidence?”

In particular, scientists can use big data for drug discovery with IBM’s Watson Discovery Advisor. This is available through a browser as software as a service. So anyone can use it to test hypotheses virtually, all based on the data in millions of published papers. As Spangler says, “Watson digests way more information than a human expert can, and it can help them be better scientists and make better predictions moving forward.”

Calculating the Kinome

A kinase is a protein enzyme that drives changes in the structure of another protein by adding a phosphate group. Typically, this changes the function of the protein. Consequently, many drugs target kinases, and these proteins could provide even more drug targets ahead. An organism’s collection of kinases makes up the so-called kinome.

Daniel Ian McSkimming, a doctoral candidate in the Institute of Bioinformatics at the University of Georgia at Athens, and his colleagues are mining the human cancer kinome—all of the cancer-related kinases. When asked how much data this involves, McSkimming says, “It’s an ever-increasing amount.” Humans have about 500 kinases, and they can be regulated in problem-causing ways in cancers. In addition, the cancer kinome includes thousands of different forms of kinases, caused by something tweaked in a kinase’s structure, often altering the kinase’s function. That complicates any effort to keep track of the kinases, their variants, how they relate to cancer, and so on.

To take on that problem, McSkimming and his colleagues created ProKinO (, which is freely available. With this knowledgebase, a user can explore functional information about kinases, as well as mutational information, information related to specific cancer cell lines, how kinases impact resistance to specific cancer drugs, and more.

“This tool gives you the ability to see how kinase researchers think about kinases,” McSkimming says. “It will help you understand the regions of the proteins that have been experimentally identified as important, and this can directly inform drug research on how to inhibit a kinase.”

ProKinO provides connected information about human cancer-related kinases, which could help drug researchers discover potent inhibitors.Image courtesy of Daniel Ian McSkimming.ProKinO will keep growing. McSkimming says, “It’s crucial that we add to it, and we’ll find both new types of analysis and new types of data.”

Focus On The Function

In all cases, big data is of use only when you know what to do with it. For example, Jordan Stockton, director of marketing for enterprise informatics at Illumina in San Diego, California, says, “Winnowing the many pieces down to the useful ones gets underemphasized.”

That step requires software that many people can use. For example, Illumina’s NextBio lets a scientist use genomic data at nearly any stage of drug discovery or development. For instance, this platform provides genomic data from cell lines that can be used to study specific drug targets.

Tools like NextBio and others mentioned here will become increasingly valuable in drug discovery. As Naomi O’Grady, marketing manager, oncology at Illumina, explains, “The discovery-to-drug process is like a funnel.” That is, researchers tend to work on more focused datasets as the process moves forward. Nonetheless, O’Grady points out, “But even at the bottom end of the funnel, the amount of information being considered is getting bigger. More information is being considered at every step.”

Consequently, pharmaceutical scientists will need ever more data to make the most of the existing knowledge base. They will also need sophisticated tools that pick out the key data and the interconnections. Only then can large datasets turn drug discovery into an efficient and effective process.