Lab Manager | Run Your Lab Like a Business

News

Researchers Optimizing End-To-End Movement of 'Big Data'

Finding ways to deal with “big bata,” which is defined as data sets too large and complex for both traditional computers and average network throughput to handle, has become a science in itself

by Clemson University
Register for free to listen to this article
Listen with Speechify
0:00
5:00

Video credit: Clemson University

Get training in Creating an Environment of Success and earn CEUs.One of over 25 IACET-accredited courses in the Academy.
Creating an Environment of Success Course

CLEMSON, South Carolina — Today’s scientists are riding an unprecedented wave of discovery, but the immensity of the data needed to facilitate many of these breakthroughs is creating internet roadblocks that are becoming increasingly detrimental to research.

Finding ways to deal with “big bata,” which is defined as data sets too large and complex for both traditional computers and average network throughput to handle, has become a science in itself.

But with an eye to the future, Clemson University researchers are playing a leading role in developing state-of-the-art methods to transfer these enormous datasets from place to place using the 100 gigabit Ethernet Internet2 Network. Owned by the nation’s leading higher education institutions, the advanced Internet2 Network is the nation’s largest and fastest coast-to-coast research and education infrastructure designed for next-generation scientific collaboration and big data transfer.

Internet2Internet2 operates the nation’s largest and fastest coast-to-coast research and education network.Image Credit: Courtesy of Internet2“We’ve leveraged advanced research networks from Internet2 and parallel file system technologies to choose the optimal ways to send and receive massive data sets around the country and world,” said Alex Feltus, associate professor in genetics and biochemistry in Clemson University’s College of Science. “What used to take days now takes hours—or even less. And these same methods apply to any project that uses large, contemporary data sets.”

Genomics research is rapidly becoming one of the leading generators of big data for science, with the potential to equal if not surpass the data output of the high-energy physics community. Like physicists, university-based life-science researchers must collaborate with counterparts and access data repositories across the nation and around the globe.

“Researchers who work with large data sets need a reliable and agile network that allows them to accelerate data transfers among collaborators,” said Rob Vietzke, Internet2 vice president for network services. “Alex’s work is a great example of how our community members come together to solve research problems that facilitate and enable scientific collaboration on a global scale.”

As a key component of their research, Feltus and his collaborators have developed an open-sourced and freely available software titled Big Data Smart Socket (BDSS) that takes a user’s request for data and attempts to rewrite the request in a more optimal way, creating the potential to send and receive data at much higher speeds.

“We’ve found the right buffer size, number of parallel data streams and the optimal parallel file system to perform the transfers,” said Feltus, who is director of the Clemson Systems Genetics Lab. “It’s very important that end-to-end data movement—and not just network speed—is optimized. Otherwise, bottlenecks on the sending or receiving side can slow transfers to a crawl. Our BDSS software enables researchers to receive data—optimized for the architecture of their own computer systems—far more quickly than before. Previously, researchers were having to move rivers of information through small pipes at the sending and receiving ends. Now, we’ve enhanced those pipes, which vastly improves information flow.”

The ever-expanding sizes of data sets used in scientific computing have made it increasingly difficult for researchers to store all their data at their primary computing sites. Therefore, it has become crucial for researchers to be able to transfer a segment of data from geographically distant repositories and—after analyzing it—to delete it to make room for new data needed to continue the advancement of the project.

Data sets stored in the publicly funded National Center for Biotechnology Information (Data sets stored in the publicly funded National Center for Biotechnology Information (NCBI) repository in Bethesda, Maryland, have grown by 8.7 quadrillion DNA base pairs since 2008.Image Credit: Courtesy of NCBI

BDSS enables researchers, many of whom are unaware of available technologies, to perform faster and more efficient transfers. The groundbreaking software takes advantage of specialized infrastructure such as parallel file systems, which distribute data across multiple servers, and advanced software-defined networks, which allow administrators to build, tune and curate groups of researchers into a virtual organization.

“Network engineers have made these gigantic pipes through Internet2 that can transfer data at a hundred gigabits per second. That’s a hundred billion bits of data, which is almost unfathomable,” Feltus said. “In order to maximize the fastest pipes ever seen in human history, the entire system must be optimized to be able to receive the data from these pipes. Our focus is figuring out ways to transfer data in parallel streams that match the number of hard drives that are receiving the data.”

Feltus collaborates with Clemson Computing and Information Technology, as well as faculty from multiple institutions around the nation, to maximize data transfer through a next-generation campus network linked to the Internet2 backbone. He and his partners have published two recent papers on maximizing big data transfer:

  • Big Data Smart Socket (BDSS): A System that Abstracts Data Transfer Habits from End Users” (Nick Watts and Alex Feltus; published in Bioinformatics) This work was supported by “Triple Gateway: Platform for Next Generation Data Analysis and Sharing” and was funded by a grant from the National Science Foundation with Washington State University’s Stephen Ficklin as principal investigator and Feltus as co-principal investigator. Other collaborators included scientists from the universities of Tennessee and Connecticut.

“Think of what we’re doing as sort of a shipping service at Christmas time,” Feltus concluded. “We’re all trying to move a lot of stuff around at the same time. But it might be more efficient to use Interstate 285 and go around Atlanta rather than drive through downtown. Or maybe use an airplane instead of a truck. We’re trying to get the data there as quickly as possible so that the customer is happy.”