The continuing trend toward digitalization of data in the lab produces more and more data every year. The drive to “go paperless” is a strategic initiative that offers demonstrable operational benefits in improving productivity, reducing cycle times, and enabling organizations to leverage experimental and operational data generated along the entire research-development-manufacturing continuum. Organizations looking to capture and leverage this data effectively are exploring solutions such as data lakes to help manage the vast quantities of data being generated, and to leverage it in an efficient manner.
The rise of laboratory informatics systems, such as electronic lab notebooks, laboratory information management systems, and laboratory execution systems, have furthered the proliferation of digital information generated by the lab. However, these disparate systems have also created multiple data silos, with data accessible only through proprietary vendor software and with incompatible data formats. So, although digitization has been achieved, true digitalization (the effective use of digital data for scientific and business purposes) often remains an elusive goal.
The natural reaction is to try to remove the artificial barriers, placing everything in a single common repository, such as a scientific data management system (SDMS), to ease data access. Broadly speaking, an SDMS falls into the category of a data warehouse (or data mart) for scientific data, where the schema for data storage must be predefined—the data is processed and structured based on its anticipated use.
A data lake stands in contrast to a data warehouse in a few key ways. First and foremost, a data lake employs schema- on-read; appropriate processing is applied only when data is queried. To enable this, raw data can be stored in structured, semistructured, or unstructured forms, and is tagged with appropriate metadata and a unique identifier. The architecture is flat, rather than the hierarchical strategy of traditional file systems, and object-oriented, making it much easier to scale and manage.
By storing large amounts of unstructured data from disparate sources, such systems can process unique and novel queries, combining and processing data as necessary, on request. The schema-on-read distributed-computing aspects of a data lake enable new ways of combining and interpreting data, often independent of the data’s original purpose. It is thus quite common to equate a data lake with “big data,” but they are not the same thing. Big data analytical efforts can be enabled by a data lake, but still require specific goals and processes to be in place, along with strong governance of the projects. Without strong oversight, data may not receive useful metadata tagging and it may still remain effectively siloed from some company stakeholders.
Data management and standardization
Although a data lake exists in large part to store disparate structured and unstructured data in an unsiloed manner, it is still important to consider some standardization of the data heading into the lake itself. In most labs, there is a variety of analytical instruments from different vendors, all creating data in various and often proprietary formats. Analyzing and/ or combining the data for any type of analysis is a time-consuming process, often involving manual steps and data transformations.
By parsing analytical data into a common or standardized format, such as those being developed by consortia such as the Pistoia Alliance or Allotrope Foundation, at the time of acquisition, an organization can avoid manual processes and transcription errors. Additionally, standardized data is naturally easier to combine and interpret alongside similar data files in the future.
During the parsing process, relevant metadata can be applied to the analytical data as well. Metadata helps add context, such as the experimental conditions, equipment used for capture, project parameters, users and materials involved, and other relevant information to help identify the data for future use. The use of metadata tags is extremely powerful for analysis, but care must be taken to properly define the terms and variables used; misuse or undisciplined usage across an organization can render advanced data analysis incomplete or make results misleading.
At the very least, the vocabulary of the metadata needs to be properly defined, so that terms are uniformly applied to analytical data when it is captured and parsed. For example, using only “weight” instead of “weight (g),” “Weight-mass,” or “mass” (among other variations) makes it much harder to cross-reference in a data search. After the common vocabulary is defined, the taxonomy and ontology for data sets should also be defined. The taxonomy is a hierarchy that defines how the various terms of vocabulary relate to each other—in our example, the unit (grams) would be a related subset of the “weight” tag. The ontology is a controlled vocabulary, where both the hierarchical taxonomy relationships and the relationships between hierarchical branches are defined—an ontology describes how the tags interrelate with each other, and allows unique and unexpected queries to be performed. A weight measurement from a certain instrument has a specific location, and is normally used by certain individuals for a certain subset of available tasks—these relationships can be defined by the ontology.
With properly defined metadata tagging and data parsing into standardized formats, the data can be automatically captured and fed directly into an enterprise data lake. The process of storing, leveraging, and reusing data becomes more efficient, more powerful, and more accurate. But at the same time, the data lake helps enable advanced real-time analytics and multifactor trend analysis. By automating the data capture and storage process and providing access to the data across the organization, the results are instantly available for trend analysis.
Organizations are still learning how to best leverage all the data that is being made available from their digitalization efforts. But it is clear that the multitude of data, along with the context provided by metadata, provides ample opportunity for new, advanced analytics efforts and data visualization. At a glance, anyone can investigate resource use (materials, instruments, lab space), monitor lab activity and identify analytical bottlenecks, or check and predict instrument maintenance and calibration needs or check instrument performance. Additionally, the contextualized data and the relationships defined by the ontologies facilitate multifactor analytics, enabling manufacturing trendline analysis and detailed batch and quality analysis.
As with many lab informatics solutions, the current trend is also toward the cloud. In some cases, the entirety of the project can be hosted on the cloud, from the data-lake storage to the lab informatics and analytics applications, but a popular option is on-premises data storage with the computing portion hosted as a cloud-based service. This arrangement leverages more cost-effective data storage and the power of purpose-built, parallel-processing systems on demand from the cloud. The well-known data-lake infrastructure Hadoop, with its wide range of modules that can be tricky to configure, has started to fall out of favor in the face of more out-of-the-box cloud-based solutions. Existing cloud vendors are well positioned to provide either the heavy lifting of analytic computing power, or a complete cloud-based data-lake solution.
As organizations continue to further their digitalization efforts, it is clear that finding ways to most efficiently leverage the data and knowledge generated internally is a strong driving force. As some barriers to data access are removed through solutions such as data lakes, others are unwittingly erected, with siloed access to organizational analytics data and dashboards. Unlocking and leveraging the analytics aspects of a business, perhaps in connection with the move toward the cloud, will allow the true power of big data to be realized.