Managing the Coming Data Flood

To Realize the Value of Diverse and Voluminous Data Requires Effective Data Management, Institutional Arrangements and Policies

Lab managers will face an increasing challenge in the next decade—managing a rising tide of data. The amount of data generated annually is forecast to double every two years for the next decade as the cost of computing and networking declines.¹ Laboratories will contribute to this data flood as the number of people and data-generating instruments connected to the Internet increases. George Gilder referred to this phenomenon as the “exaflood.” (Gilder is senior fellow at the Discovery Institute and chairman of George Gilder Fund Management, LLC.) An exabyte is equal to one billion gigabytes, or approximately 50,000 times the contents of the U.S. Library of Congress. From the laboratory perspective, some research projects routinely generate terabytes and even pentabytes of data. (A terabyte is 1 trillion bytes; a pentabyte is 1,000 terabytes.) Many other projects result in smaller, heterogeneous collections with valuable attributes. To realize the full benefit and value of these diverse and voluminous data requires effective data management techniques, institutional arrangements, and policies.

Factors promoting the exaflood

From the laboratory perspective, factors promoting the exaflood include globalization of research, open innovation and the increasing use of supercomputers. Global corporations perform research in different countries scattered across the globe. For example, Dow Chemical, the largest U.S. chemical maker, has research facilities in the U.S., China, India, Saudi Arabia and other countries. Global corporations often have shared research projects performed at facilities thousand of miles apart. Data sharing required by open-innovation projects performed at two or more organizations requires sharing data and reports.² U.S. federal laboratories license technology to the private sector. These and other activities contribute to the increasing volume of scientific and engineering data moving over the Internet.

IBM Roadrunner Supercomputer. Photograph courtesy of Los Alamos National Laboratory

Supercomputers enable the creation and processing of large volumes of data. Supercomputing resources available on the Internet are located at http://userpages. umbc.edu/~jack/supercomputer-resources.html. The federal government’s national laboratories, including Oak Ridge National Laboratory and Sandia National Laboratory, are the home of supercomputers that can perform astonishing numbers of calculations at amazing speed. For example, Oak Ridge’s Jaguar, a Cray supercomputer, often achieves sustained performance of over a pentaflop (a quadrillion mathematical calculations per second).

National Oceanic and Atmospheric Administration (NOAA) Supercomputer. Photograph courtesy of NOAA.

Some universities operate supercomputing centers that provide services not only to their own researchers but also to researchers across the country and even worldwide. For example, the University of California, San Diego, operates the San Diego Supercomputing Center. Academic scientists and engineers nationwide use its computing equipment and the services of its 250 staff members to work on a wide variety of problems. These include cardiovascular disease, protein structure, molecular dynamics, materials flow and mixing, earthquakes, and mathematics. Increasingly, supercomputing is being used in the social sciences and liberal arts. Environmental and climate studies require the processing of enormous amounts of data.

These large volumes of data and calculation results are increasingly moved on the Internet. Supercomputing centers and the Internet make it possible for relatively small research centers and research groups to avoid the expense of purchasing their own supercomputers and setting up information technology administrative staff. For example, the Open Science Grid associated with Argonne National Laboratory promotes discovery and collaboration in data-intensive research by providing a computing facility and services that integrate distributed, reliable and shared resources to support computation at all scales.

The largely Texas-based Petroleum Engineering Grid (PEGrid) is another example of an open-grid project. According to senior scientist Alan Sill of Texas Tech University, it began in late 2009 with the objective of allowing the best practices in academia and industry to be shared. Having spent huge sums to generate seismic and other oilfield data, data security is a major worry for oil and gas companies and the firms generating this data. These concerns discourage many companies from exploring grid technologies. According to Sill, one reason for this is that, compared with the best practices of grid computing, many firms use outdated data security methods. PEGrid provides a means for these firms to become familiar with grid computing security practices. As they do so, they are more likely to employ grid computing techniques extensively.

Cloud computing

Cloud computing is a trendy way of describing web-accessible data storage and processing.³ It offers an efficient way for organizations to deliver and use information technology infrastructure, applications and services. Examples of data stored in the cloud include your product recommendations from Amazon.com based on your previous purchases. Social networks such as Facebook and professional networking sites such as LinkedIn save information about you in the cloud where others can retrieve it. Photo-sharing sites store images in the cloud. Webbased e-mail programs keep messages in the cloud. People also are starting to back up the contents of their computers to the cloud. This makes files accessible from almost anywhere when using an Internet connection.

Protecting proprietary information and preventing unauthorized access to data stored in a public cloud is a serious concern. Consequently, many organizations have concerns about storing these data and their own personal data elsewhere other than on their own computer system. Private data clouds exist on an organization’s own computer system, protected by corporate firewalls. They enable organizations to decrease capital costs by reducing the number of servers, minicomputers and mainframe computers they need since these resources are shared by people adding to and processing data in the cloud.

Technologies promoting the data flood

Several technologies in particular are promoting the flood of scientific data needing to be transmitted and processed. The discovery and evaluation of oil and gas reservoirs have long required powerful computers and processing of very large amounts of data. These requirements have intensified as the industry has increasingly explored for hydrocarbons under ocean waters. Offshore oil and gas exploration requires “immense levels of information processing,” according to Robin West, chairman of consulting firm PFC Energy.⁴ Seismic imaging is the key technology in discovering these oil and gas fields. This technology has become steadily more sophisticated, requiring increased data-processing capacity. As the industry explores for oil in deeper rock formations at greater ocean depths, large oilfields are being discovered under salt layers. These thick layers disrupt sound waves, blurring the seismic images. Sharpening these images requires the use of supercomputers.

The U.S. Food and Drug Administration requires the pharmaceutical industry to collect and report increasingly large volumes of clinical trial data. Often these trials are conducted in other countries such as India. The amount of data required to support new drug applications is growing rapidly.

Another field with huge data generation and processing requirements is weather forecasting. Both local weather data stations and satellites collect and transmit large volumes of data for analysis. Since the December 26, 2004, tsunami tragedies in the Indian Ocean that killed over 200,000 people, networks of detectors have been placed in the ocean in many locations to detect tsunamis and predict their size and course to permit timely evacuations of coastal areas. For example, just three hours after the February 27, 2010, earthquake in Chile, a sensor on the ocean floor 205 miles from the epicenter transmitted the first evidence that a tsunami was headed across the Pacific Ocean toward Hawaii.⁵ Scientists at the National Oceanic and Atmospheric Administration Pacific Warning Center calculated the course of the tsunami and how powerful it would be when it hit coastlines. Emergency sirens gave coastal residents more than ten hours to evacuate before destructive waves washed ashore. As a result, there were no significant injuries or major damage.

Speaking at the AAAS annual meeting, Hal Varian, on the faculty at the University of California, Berkeley, and chief economist for Google, Inc., described a new data analysis tool called Google Insight for Search. It permits the correlation and prediction of data with the number and distribution of individual Google searches. For example, by mining past data on Google searches for information on Toyota automobiles and correlating them with Toyota sales, Varian was able to predict current sales volume with a reasonable amount of accuracy before industry sales data was released.

Sciences

Virtually all fields of science are increasingly performed by researchers scattered over the globe who both generate and process increasing amounts of numeric and image data. For example, “metagenomics” is a relatively new field of research that uses the latest developments in DNA sequencing technologies to study genetic material recovered directly from environmental samples. The vast majority of life on earth is microbial. Previously, most research on microbes depended on the preparation of laboratory cultures. However, since 99 percent of microbes cannot be cultured, it is only recently that modern genetic sequencing techniques have allowed determination of the hundreds to thousands of microbial species present at any given environmental location.

The amount of data defining the metagenomics of these microbial ecologies is growing exponentially as researchers acquire next-generation sequencing devices, said Larry Smarr of the University of California, San Diego, who spoke at the 2010 annual AAAS meeting. As an example, he cited Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA), funded by the Gordon and Betty Moore Foundation. Its data repository at the University of California, San Diego, presently contains over 500 microbial metagenomics datasets plus the full genomes of approximately 166 marine microbes. Over 3,000 registered end users from 70 countries can access existing data and contribute new metagenomics data via the Internet.

The increasing number of scientific research space probes launched by NASA and other countries requires both supercomputer power and transmission around the globe for data analysis and study. Climate studies require supercomputing power to perform the complex simulations required to study global warming and other phenomena.

According to Les Cottrell, director of computer services for the Stanford Linear Accelerator Center (SLAC), high-energy physics has become a global enterprise, with data needing to be available to many sites around the world for analysis. SLAC and other facilities are generating huge amounts of data that must be transmitted over great distances. Data is flowing into, as well at out of, the United States. For example, CERN’s Large Hadron Collider in Switzerland is providing enormous amounts of data to U.S. researchers.⁶

Challenges and opportunities

There are several unresolved issues facing lab managers. How can research organizations ensure that data are properly archived and made available to all the researchers who might find them useful? How can the origin and accuracy of data be ensured and properly documented? What new approaches are needed to ensure that the preservation and access to data of great scientific and social value is a priority?

Despite Greer’s concern and the challenges raised in the opening paragraph of this article, symposium attendees were excited about the new possibilities for R&D offered by the exaflood. They felt that laboratories would surf the flood of data rather than be overwhelmed by it.

References

1. B. Swanson and G. Gilder, “Estimating the Exaflood: The Impact of Video and Rich Media on the Internet—‘A Zetabyte’ of Data by 2015?” Discovery Institute Report (January 29, 2008), http://www.discovery.org/a/4428.

2. J.K. Borchardt, “Open Innovation Becoming Key to R&D Success,” Lab Manager Magazine (January 31, 2008), www.labmanager.com/articles.asp?ID=28.

3. K. Boehret, “Get Your Storage out of the Cloud,” Wall Street Journal (March 1, 2010) http://online.wsj.com/article/SB4000 1424052748704188104575083533949634468.html.

4. B. Casselman and G. Chazan, “Cramped on Land, Big Oil Bets at Sea,” Wall Street Journal (January 7, 2010), http://online.wsj.com/article/SB126264987791815617. html?mod=WSJ_hps_LEFTWhatsNews.

5. D. Leinwand, “Tsunami Alerts Help Save Lives,” USA Today (March 1, 2010), http://www.usatoday.com/tech/ science/2010-03-01-tsunami-alerts_N.htm.

6. K. Dean, “Data Flood Speeds Need for Speed,” Wired (February 13, 2003), http://www.wired.com/techbiz/it/ news/2003/02/57625.