Making Data Meaningful

A search of Google Trends for “big data” in news headlines reveals almost no interest until 2011, and
then the numbers soar.

Written byMike May, PhD

| 7 min read

Listen with Speechify

0:00

7:00

How to increase the value of integrated information analytics

Moreover, a search for “big data” on Google in general returns more than 300 million hits. Consequently, it’s no surprise that Igor Jurisica—who holds the Tier I Canada Research Chair in Integrative Cancer Informatics and is a professor of biomedical physics and computer science at the University of Toronto in Canada— says, “Recently, some of the most common buzz in informatics focuses on big data.” Despite today’s enormous amount of data, says Jurisica, there is a shortage of knowledge.

There’s also a shortage of infrastructure. “A lot of people jumped on the big-data bandwagon prematurely,” says John F. Conway, global director, R&D strategy and solutions at LabAnswer in Sugar Land, Texas. “Their underlying informatics environments weren’t ready for it.” That continues to be the case. “Lots of companies are having difficulty assessing the data they need to make informed decisions [or garner knowledge from it].” Conway says. Consequently, these companies get less from informatics than they could.

Historically, computer systems focused on structured data, such as numbers in a table in a database, but unstructured data, such as text, created a challenge for analysis. “We have a reasonable handle on structured data,” Jurisica says, “but the need is quickly growing to integrate unstructured and structured data.” For instance, biomedicine must combine structured information, such as test results or even the sequence of a patient’s genome, with unstructured data, such as written medical records. To make use of this data in big ways, researchers must analyze millions of records. “That takes enormous computing power to sift through the data and then turn that efficiently into meaningful information and present that to the user in effective way,” Jurisica explains.

Although the data makes informatics complex, other elements make it even more intractable. For example, informatic systems must deal with various kinds of computer architecture—from smartphones and tablets to supercomputers and the cloud. On top of that, informatics users range from biologists and engineers to patients and physicians. “So a system needs to be flexible in terms of what to present to whom and in what form to increase the value of integrated information analytics,” says Jurisica.

A technology transition

Although chemistry informatics companies keep providing scientists with more and more options for data analysis, how those results are being delivered and reviewed is evolving rapidly. As Ryan Sasaki, director of global strategy at Advanced Chemistry Development (ACD/Labs) in Toronto, Canada, explains, “The biggest trend is the continuing evolution of thin-client technologies.” With a thin client, a low-powered device depends on a high-powered one. For example, a laptop can be connected to a supercomputer to deliver sophisticated analysis without the user needing to own all the hardware.

A thin-client approach also reveals some of the history of informatics. “It’s very interesting to follow lab informatics over the last 20 years or so,” says Sasaki. “In the early days, companies had their own IT developers who made their own systems.” The overhead required to develop and maintain those systems, though, triggered a transition. “A 180-degree change caused customers to look to providers for an end-to-end solution, like an ELN or LIMS—electronic lab notebook or laboratory information management system—where one size fits alls,” Sasaki says. That approach still created challenges, because different customers like to do things in their own way.

“So now we are seeing more of a hybrid approach,” Sasaki says, “where companies are looking to build internal web technology and have vendors provide applications that plug into that.” As an example, he mentions the ACD/Spectrus Platform, which can be used to collect and analyze a wide range of chemical and analytical data. Although Sasaki calls this a robust platform that can stand on its own, market requirements demand that it be a platform that can easily plug and play with other systems in an organization. Furthermore, the transitions in informatics demand that nonexperts can use it.

Pushing productivity

Informatics helps scientists keep track of and analyze enormous amounts of data through a variety of portals, from computers to smartphones. (Image courtesy of BIOVIA.)Although modern informatics allows researchers to accomplish more with less in-house hardware or software, users expect more from the results. “If you look at the history of the space,” says Gene Tetreault, senior director of enterprise laboratory management at Boston area-based BIOVIA, “there wasn’t a lot of push on productivity for informatics in the pharmaceutical industry, but the tightening of financial belts is driving the need for informatic solutions to become more productive.”

In the pharmaceutical industry in particular, that increase in productivity must also meet expanding regulatory requirements. So informatic systems must constantly adapt to new compliance issues. “The regulations are increasing by orders of magnitude, so you need electronic systems to keep up,” says Tetreault.

While achieving these goals, informatic users also want more mobility. Tetreault says, “The desire is increasing dramatically to be able to access data in or outside the lab on a phone or tablet.”

Lab informatics will also incorporate augmented displays—things such as Google Glass and beyond—to enhance productivity. “Imagine standing in front of a lab refrigerator and seeing the samples inside without opening the door,” Tetreault says. “This kind of technology could also record what you are doing at the lab bench.”

Going global

A large company’s productivity also depends on fluid interactions between different business systems, units, and locations. As a result, Mark Harnois, director of product management for the informatics business unit at Waters in Milford, Massachusetts, says, “In the last few years, we’ve seen many customers evaluating all the technology they have in their labs and developing strategies to harmonize, globalize, and standardize the technology.” To harmonize and standardize, companies want different systems and sites to, basically, reduce their software footprint and benefit from the reduced training, support, and software validation. As Harnois says, “You want to make sure there’s seamless integration between solutions you have.” Globalizing means being able to do that around the world.

Despite this emphasis on wide-scale consistency, Harnois points out the need for flexibility in an informatics system. He says, “Your technology must be able to adapt to your environment.”

As an example, Harnois says, “We offer solutions that allow users to connect different software packages, like a chromatography system and a LIMS system, to multivendor business systems applications.” He adds, “Our products are also designed to be deployed in enterprise-wide area networks that can be global.”

Tackling problems in teams

In the life sciences, proteins create one especially tricky challenge: figuring out function from form. How a protein works depends on its components, which are amino acids, and the three-dimensional shape that they take, which arises from folding. This knowledge could reveal, for instance, how a molecule drives a disease as well as how a treatment might manage it.

An informatics approach to molecular biology generated this diagram of a physical protein interaction network. (Image courtesy of Igor Jurisica.)Jurisica and a team of colleagues turned this molecular biology problem into an informatics one. Working with researchers at the Hauptman Woodward Medical Research Center in Buffalo, New York, they used robots to screen 13,000 proteins using 1,536 different crystallization conditions and then took six images of each over time. That created about 120 million images. Then, the scientists used algorithms for morphological image analysis and machine learning to characterize and classify the data on IBM’s World Community Grid, which is created from a network of almost 3 million high-end workstations around the world. “We created a much better image classification algorithm using this grid that now rivals human expert performance,” Jurisica says. Even on this high-performance grid, it took 5.5 years to analyze all the data, but it would have taken182 years on the hardware that the team had. As Jurisica says, “This was a game-changing collaboration, which enabled us to completely change our approach to solving this protein crystallography problem.”

Other groups also team up to expand the applications of advanced informatics. For instance, Matthew Hahn, professor in the department of biology and the School of Informatics and Computing at Indiana University in Bloomington, says, “With a virtual machine, you can use a computer on your desktop, and it can do your analysis for you.” He adds, “You don’t need to do things with command lines, because you get a desktop of tools.”

For example, the US National Science Foundation (NSF) developed iPlant to provide scientists with tools to finds ways to feed the world’s expanding population. This project supplies researchers with access to databases and software, all delivered by the NSF plus the University of Arizona, Texas Advanced Computing Center, Cold Spring Harbor Laboratory, and the University of North Carolina at Wilmington. According to the iPlant website, “By enabling biologists to do data-driven science by providing them with powerful computational infrastructure for handling huge datasets and complex analyses, iPlant fills a niche created by the computing epoch and a rapidly evolving world.”

Powerful informatic approaches, such as the Waters NuGenesis Lab Management System, let scientists connect a variety of research platforms with business ones. (Image courtesy of Waters.)Hahn and his colleagues at Indiana University provide a similar NSF-funded service called the National Center for Genome Analysis Support. “It lets anyone with NSF funding get help with sequencing analysis,” Hahn says. “You don’t need to install your own software, and we will even tell you which button to press.”

Looking ahead

Before launching into a big-data driven system of analytics, scientists must get the informatics under control. For example, Conway points out that many companies still need to implement the ability to search an entire enterprise system, and that includes metadata, such as the details behind an experiment. “If the metadata are not there, a search is worthless,” Conway explains. “People need to think about ways to ask more from their data and put things in place now.”

The Italy-based European Institute of Oncology uses the Thermo Scientific LIMS for Biobanks to process thousands of biospecimens. (Image courtesy of Thermo Fisher Scientific.)Putting the data in the right place determines its value. For example, the European Institute of Oncology (IEO)—a Milan, Italy-based organization committed to making an active contribution to fighting cancer, particularly tumors of the breast, lung, prostate, and bowel— turned to a solution from Thermo Fisher Scientific in Waltham, Massachusetts. IEO implemented the Thermo Scientific LIMS for Biobanks to more efficiently manage its biospecimen data. For instance, IEO is using this system to process more than 4,000 biospecimens annually, including liquids, solids, and nucleic acids. Previously, IEO’s data management solution did not allow integration across multiple platforms, so information remained in silos or needed to be combined manually. Now sample data are integrated across multiple systems.

Education also comes into play when planning for tomorrow. What does a science student need to learn in order to develop an informatics arsenal? Hahn says, “Basic statistics and probability are really important.” He adds, “You need some basic UNIX command-line skills, and Python and Perl are good for moving around text files, like DNA strings.”

To make the most of informatics, science needs advanced hardware and software tools, plus personnel to run them. Only then can big data turn into big improvements.

How to increase the value of integrated information analytics