Handling the Exponential Growth in Raw and Processed Genetic Data
Gene sequencing is all about data—3.2 gigabytes for a single human genome, with several times that for making raw sequences relevant to real-world problems.
Mining the genome for medical intelligence multiplies the data “crunch” for gene sequencers and value-added services that annotate gene sequences for their relevance to protein and metabolite concentrations, and to both diseased and healthy states.
Massively parallel, high-throughput sequencing has raised the data-handling ante significantly. Traditionally, sequencing labs have expanded their data-handling capabilities by purchasing and networking more “boxes”—central processing units (CPUs) and storage. But the sheer volume of data has necessitated the need for an alphabet soup of advanced computing architectures such as compute unified device architecture (CUDA)-enabled graphics processor units (GPUs), general-purpose (GP) processor combinations, multicore CPUs, and GPU-CPU clusters.
Hybrid-Core Approaches
Another approach, generally referred to as hybrid-core or heterogeneous computing, employs different processing units (GPUs, CPUs, co-processors) and diverts computational tasks to the “correct” unit. “Heterogeneous computing involves different computing resources where they are most appropriate,” explains George Vacek, Ph.D., life sciences director at Convey Computer (Richardson, TX), a computer manufacturer that serves the life science and other industries.
Convey’s hybrid core consists of a standard Intel core integrated with a co-processor. An application’s key instructions can then be off-loaded from the core to the co-processor, speeding up the overall application.
“Hybrid-core architecture is really good at pattern matching, graph analytics, and executing algorithms used in bioinformatics and nextgeneration sequencing,” Vacek says.
“Next-generation or short-lead sequencing generates lots and lots of data. The rate at which sequencing is improving on a dollar basis is expanding exponentially, even for small or very modest facilities,” Dr. Vacek explains. “The challenge, if you experience a three-order-of-magnitude increase in the amount of data you generate, is you can’t have a thousand-fold increase in the size of your server or data center.”
Sequencing labs experiencing data woes run into physical limitations of space, budget, and management related to power, cooling, and cabling. Hybrid core computing replaces between five and 25 traditional servers with a single server maintained in Convey’s facilities, without the expense and logistical issues.
Convey says it can replace ten servers for reference mapping of next-generation sequencing with one server. “Think of this as the same effective throughput with one-tenth the equipment, or finishing a job in one-tenth the time,” Vacek says. “That’s a huge cost savings for power, cooling, and cabling. And if you have ten times as many systems, even if they are fairly stable, the laws of statistics tell you there will be ten times as many server failures, not to mention replacement. So there are some real savings in operating costs.”
“Once we could afford wholegenome sequencing, we found a significant bottleneck in the time required to process the data,” said Laura Reinholdt, Ph.D., a research scientist at The Jackson Laboratory (JAX) located in Bar Harbor, ME. “That’s when biologists here began to seek tools and infrastructures to more expediently manage and process the expanding volumes of NGS [next generation sequencing] data.” To solve this problem, JAX sought heterogeneous computing to complement its existing compute clusters.
In biology research, higherperforming informatics means more than simply completing the job faster. “It brings other positives to research; for example, you can try lessapproximative methods you wouldn’t have tried before, or larger data sets, or attack larger problems that were impractical or literally impossible,” Vacek tells Lab Manager Magazine. “The real advantage is driving new areas of science and achieving higher-quality research.”
“We’ve had assemblies we couldn’t complete on our 256-node cluster simply because they were taking too long,” said Dr. Guilherme Oliveira, president of the Brazilian Association for Bioinformatics and Computational Biology and a member of the Board of the International Society for Computational Biology. “We evaluated several platforms and are excited to be working with a hybrid-core system.”
“Downstream” Data Management
The data crunch in sequencing is due to a significant degree to the precipitous drop in the cost of whole-genome mapping, which has been driven by personalized medicine. Personalized medicine seeks to differentiate, on the basis of genes or other biomarkers, subsets of diseases like cancer. Personalization is based on the observation that genetic differences may signal radically different prognoses, and suggest differing approaches to treatment for individuals with, for example, prostate cancer.
The connection between Moore’s Law and the cost of sequencing a genome says a lot about the two disciplines and how they connect. Moore’s Law states that the complexity of integrated circuits doubles every 18 months or so. Between the late 1990s and 2008, the cost of sequencing a genome similarly halved over approximately the same 18-month timeframe.
The widespread adoption of second-generation sequencing in 2008 changed everything. The progression beyond traditional automated Sanger sequencing— the technology responsible for the Human Genome Project—to second-generation techniques had the immediate effect of accelerating the drop in cost per genome from a Moore’s Lawtype relationship to a halving every two to three months. The imminent adoption of thirdgeneration sequencing could bring about the much-anticipated “thousand-dollar genome” within a few years, according to Hank Wu, director of translational informatics at Remedy Informatics (Sandy, UT). Remedy provides software for connecting genomics with scientific and clinical data to enable “translational” research connecting basic R&D to clinical practice. Among the company’s offerings are software platforms for enabling the basic research on which personalized medicine depends. “Personalized medicine, which depends so heavily on genomics, is now pushing toward clinical relevance,” Mr. Wu says. “Personal genomes may become as common as a routine doctor’s visit.”
Yet even as the price drop for sequencing is falling five times more rapidly than Moore’s Law, the level of assigning relevance to genes, their downstream products, and health status remains vastly underserved. This downstream genomic bioinformatics, Remedy Informatics’ specialty, is equally critical to making genomics an everyday tool with the potential of dramatically lowering health care costs.
“Gene sequencing is no longer the weakest link. Rather, it’s our ability to aggregate and harmonize genomic data with the clinical world,” Wu says.
Angelo DePalma is a freelance writer living in Newton, NJ. You can reach him at angelo@adepalma.com.
For additional resources on Gene Sequencing, including useful articles and a list of manufacturers, visit www.labmanager.com/sequencing