Lab Manager | Run Your Lab Like a Business

Big Data Mining

This new ability to ‘do something with the data’ is not driven by size but by analytical capabilities

by Bernard B. Tulsi
Register for free to listen to this article
Listen with Speechify
0:00
5:00

New Analytical and Computing Tools Allow Researchers to Make Better Sense of Data

Not so long ago, the power and promise of big data appeared to reside in three dimensions—the volume, velocity, and variety of available sources. While size, speed, and range remain key attributes, “The big data revolution is that now we can do something with the data,” Harvard’s Weatherhead Professor Gary King told writer Jonathan Shaw for a 2014 Harvard Magazine article, “Why Big Data Is a Big Deal.”

This new ability to “do something with the data” is not driven by size but by analytical capabilities—better statistical and computational methodologies and powerful algorithms, according to King. The increasing use of machine learning tools for handling, mining, and making sense of data is driving this process even faster. In its literature, SAS, a global supplier of advanced analytics tools, characterizes machine learning as “a method of data analysis that automates analytical model building.” This means that machine learning uses “algorithms that iteratively learn from data” enabling computer systems to “find hidden insights without being explicitly programmed where to look.”

Big data are escalating in value in the laboratory, as they have in a variety of other business enterprises that rely on effective generation and management of their data assets. Clinical laboratories have long been valuable generators and repositories of healthcare data. Speaking at Stanford University’s 2015 Big Data and Biomedicine Conference, Rob Merkel, healthcare and life sciences leader with IBM Watson Group, pointed to two approaches the group uses for gathering appropriate information—knowledge- and data-driven. The knowledge- driven side is based on 700,000 new scientific articles per year and data from 180,000 clinical studies—all strongly tied to clinical laboratories. On the data-driven side, he noted that on average 400 gigabytes are generated during an individual’s lifetime, plus about 6 gigabytes of genomics data—all solidly linked to clinical labs.

Related Article: Advancing Big Data Science

Jarrad Hampton-Marcell, research coordinator at Argonne National Laboratory, says that while he and his colleagues do not regularly use the term, big data are involved in many aspects of their work because they have to mine and manage a tremendous amount of data on a regular basis. His work entails collaborating with worldwide consortiums such as the Earth Microbiome Project, which focuses on DNA extraction associated with microbial ecology to develop standardized methodologies and pipelines. These projects, which incorporate more than 10,000 studies and 100,000 collaborators, facilitate the comparison of microbiome data across numerous geographic regions to enable better understanding of biological processes in different environments. “Our lab alone processes about 20,000 samples a year,” he says.

While this research examines the genomics of individual or communities of organisms, it also examines environmental linkages—broader interactions and relationships by modeling possible responses by organisms and their communities to changes in prevailing conditions. He says that with the aid of powerful sequencers and supercomputers, “What we essentially do is next-generation sequencing that allows us to look at every organism within an environment and look not only at structure and function but also at relationships to different contextual markers.”

He says that today’s powerful sequencers generate data that are transferred to supercomputers that run large algorithms capable of handling tens to hundreds of millions of data points. “This enables us to not only move fast but also accurately as well.”

Prior to these advances, researchers were not only aware of the issues but were also quite capable of developing research questions around them. Hampton-Marcell notes, however, that they did not have the technologies to conduct the analyses, which imposed severe limitations. Still, there are inherent challenges. Big data analyses are not typically controlled experiments, and this introduces limitations and challenges around the ability to control for variations and potential confounders and in adjusting for their impact on results.

Related Article: Launching a Big Data Project

To be sure, the advent of supercomputing power is providing powerful solutions. “It used to be true that you couldn’t possibly look at all the data, so it was necessary to come up with some form of statistical sampling that was representative of the whole population of data and do calculations against that,” says Trish Meek, chief strategist for informatics and chromatography solutions at Thermo Fisher Scientific.

“With the availability of supercomputing power, you don’t have to take a subset of the data anymore—you truly can mine and use all of the information you have,” she says. Meek says that customers really became interested in this ability to use all their data for research and decision processes during the past five or six years.

She notes that within the past few years the concerns in labs have shifted from finding and fixing current problems in the lab to trying to detect potential future problems and taking steps to avoid them. “That’s where the focus is now. Not ‘How do I stop and recall a bad product?’ but ‘How do I stop a bad product in the first place?’”

“Our customers are looking at a few things to accomplish that,” she says. First, labs are doing a much better job of connecting all their systems. To ensure informed decisions, there has to be a connected flow of information throughout the enterprise from simple devices like balances to more complex instruments like chromatography systems, she says.

She points out that it is better recognized now that manual pockets of information in silos, which are not using tools like IBM Watson at a higher level, are excluding information and hampering the ability to use all available information in the service of the laboratory. “That is the biggest change we are seeing.”

In addition, Meek says, that may be true of Thermo Fisher’s customers that are choosing to utilize tools outside the laboratory. They have an IT department that focuses on analytics that pulls in data from the lab, from the manufacturing environment, from the ERP systems, and from the LIMS, so they are able to look at data across all areas, she says. “The key, however is that the LIMS provides the ability to pull all the lab data into one location. So that’s one system that they need to hit to get all the information they need for laboratory answers,” she says.

Andy Walker, principal engineer, National Renewable Energy Laboratory (NREL), is engaged with the Renewable Energy Optimization (REopt, which was initiated in 2007 as REO) that seeks to automate the identification of cost-effective renewable energy projects. “The objective of REopt is to minimize life cycle cost of projects,” he says.

He notes that the typical constraints in renewable energy projects may include stipulations such as the need to include 30 percent renewable electricity, 20 percent carbon reduction (the goal of many corporations), net zero energy, land use limits, and initial cost limits among others. “The constraints are the fun part, because constraints shape the problem and the resulting recommendations,” he says.

Walker says, “We define big data as a structured database approach to collecting, validating, curating, querying, visualizing, and making available for sharing tremendous amounts of information.

REopt taps into available data on resources (amount of solar, wind, and biomass resources available at each location in the United States and at many international locations), utility data (residential, commercial and industrial utility rates, policies regarding net metering, and interconnection of distributed generation), cost adjustment factors, and arrangement of available incentives, says Walker.

“All of this relies on the geospatial information systems that the NREL maintains and uses to make site-specific data available to the industry and the public. The required information from our customers, such as the amount of different fuels and electricity used at each site, available land area, among others, also depends on big data if we are examining a large number of sites,” he says.

Reflecting on these advances over traditional approaches, Walker says, “Enterprise- wide planning projects used to be prohibitively expensive and time-consuming before the availability of big data. When we were only examining renewable energy measures at a single site, it was feasible to do it with existing tools. But it requires a big data approach to find the optimal combination out of all the possible permutations of multiple projects at multiple sites for an agency or corporation that operates a whole system of real property.

“For example, we recently were able to deliver recommended sizes for photovoltaic and wind energy projects and other details such as energy, cost, and carbon savings for each one of over 44,000 cell phone towers operated by the country’s leading telecommunications companies. Without big data and our automated approach, that would have taken us about 120 years the way we used to do it. But with automated calculations and readily available big data, it took three months to deliver the results, and the company is now entering a pilot to begin implementing the cost-effective projects that the study identified.”

Turning to technology questions, Walker noted, “I invented REO in 2007 based on a ‘search’ approach—an evolutionary algorithm to search the entire domain and then a gradient-reduction algorithm to fine-tune to the optimum.

“Subsequently, my colleague Travis Simpkins and our team coded REopt, which is based instead on a mixed integer linear programming algorithm to deliver optimal recommendations. REopt is a time-series calculation, which allows high-resolution time-series analysis, and enables a detailed treatment of storage (batteries), which is of increasing importance as the cost of storage decreases and the penetration of intermittent renewable energy generation increases,” he says.

Walker says that the application of big data to the renewable energy industry is “poised to explode.” He explains, “We use big data to identify cost effective projects but there is tremendous opportunity around the way assets are operated and managed after they are installed. Significant detail is required to keep increasing the amount of distributed renewable energy on the utility system and big data is the key to that.

“There will be an evolution of renewable energy systems from those that passively deliver renewable energy whenever it is available to those that actively modulate their output depending on the needs of the larger system and the value of the power.

“There is also a market for the ancillary services that renewable energy systems can deliver such as reactive power (kVAR), voltage support, frequency support, and ultimately reliability of the entire system.

“Sophisticated controls will involve forecasts of the solar and wind resource (weather), forecasts of load requirements, and dispatch of storage (batteries). This will all require very detailed, high- resolution, time-series data enabled by advances in big data,” says Walker.