Scientists Help Tame Tidal Wave of Genomic Data Using SDSC’s Trestles

Sequencing the DNA of an organism, whether human, plant, or jellyfish, has become a straightforward task, but assembling the information gathered into something coherent remains a massive data challenge. Researchers using computational resources at the San Diego Supercomputer Center (SDSC) at the University of California, San Diego, have created a faster and more effective way to assemble genomic information, while increasing performance.

Written byLab Manager

| 3 min read

Listen with Speechify

0:00

3:00

Sequencing the DNA of an organism, whether human, plant, or jellyfish, has become a straightforward task, but assembling the information gathered into something coherent remains a massive data challenge. Researchers using computational resources at the San Diego Supercomputer Center (SDSC) at the University of California, San Diego, have created a faster and more effective way to assemble genomic information, while increasing performance.

To continue reading this article, sign up for FREE to

Membership is FREE and provides you with instant access to eNewsletters, digital publications, article archives, and more.

Unlock for FREE

Add Lab Manager as a preferred Google source to see more of our trusted coverage.

Related Topics

In a paper presented the past month at the 39th International Conference on Very Large Databases (VLDB2013) in Riva del Garda, Italy, Xifeng Yan, the Venkatesh Narayanamurti Chair of Computer Science at the University of California, Santa Barbara, explains how he used SDSC’s Trestles compute cluster to help develop a new algorithm called MSP (minimum substring partitioning) that helps to assemble genomes with extreme efficiency. MSP is a critical part of a pipeline, or a group of software that assembles entire genomes, with each piece of the software doing one part of the job. Yan and his colleagues were able to optimize one of two steps to use a mere 10 gigabytes of memory without runtime slowdown.

Xifeng Yan, UC Santa Barbara. Photo courtesy of University of California, San Diego

“High-quality genome sequencing is foundational to many critical biological and medical problems,” said Yan. “With the advent of massively parallel DNA sequencing technologies how to manage and process the big sequence data has become an important issue. Experimental results showed that MSP can not only successfully complete the tasks on very large datasets within a small amount of memory, but also achieve better performance than existing state-of-the-art algorithms.”

According to Yan, his experimental results demonstrate that MSP’s improvement in efficiency might soon make it possible to assemble large genomes using smaller, less expensive, commodity clusters rather than requiring high-cost, high performance resources.

Knowing the whole genome of various species underlies biological and medical research, such as understanding evolution pathways or identifying the cause of diseases. However, existing sequencing techniques produce huge amounts – billions for a higher organism such as a human – of overlapping short sequence randomly sampled from the genome. A major challenge in genome research is to assemble those short reads, which vary from ten to several hundred bases, back into the whole genome, a task that requires vast amounts of memory. It would be similar to gluing together an encyclopedia from a haystack of words and sentence fragments.

Using Trestles, Yan and his colleagues demonstrated that MSP reduces one of the steps required so that it uses significantly less memory than widely-used algorithms, removing one of the bottlenecks in processing whole genomes. Algorithms such as Velvet and SOAPdenovo struggle to computationally to prepare a virtual scaffolding upon which to assemble the sequence into complete genomes. MSP, a disk-based partition method, streamlines the creation of such scaffolding, known as a De Bruijn graph. A mammalian-sized genome processed using other algorithms would consume hundreds of gigabytes of memory, while MSP allows researchers to complete a key step to ten gigabytes of memory without runtime slowdown.

Yan and his colleagues are working on a second step that also consumes a significant amount of memory, and have so far reduced its memory use by two-thirds with the goals of further reductions in the future. Additional researchers include Yang Li, Pegah Kamousi, Fangqiu Han, Shengqi Yang, and Subhash Suri, all with UC Santa Barbara. The full paper can be viewed at http://www.cs.ucsb.edu/~xyan/papers/vldb13_debruijn.pdf.

Scientists Help Tame Tidal Wave of Genomic Data Using SDSC’s Trestles

Related Topics

When Lab Innovation Meets Sustainability

Sponsored

The Decisions That Define Your Lab’s Environmental Footprint

From Instruments to Infrastructure: Implementing and Scaling Lab Automation

Stop Blaming Pure Water for Stainless Steel Corrosion

In Step with Scott: Demystifying How Lab Managers Decide