Their group has developed Atlas2 and SNPTools, two software packages for variation analysis of personal genomics and population sequencing data. He talks about how some of the current data analysis challenges are being overcome and what can be expected to change in the near future.
Q: What types of sequencing projects is your group currently working on?
A: Our group is involved in the discovery of candidate genes, genetic variants, and population genetics studies, and we also develop informatics tools for data analysis. Over the years, we have developed a number of well-known informatics tools for genetic variant discovery, such as the Atlas2 and SNPTools. Nearly two years ago, we were one of the few groups in the world to start using the Amazon Cloud for variant calling in the 1000 Genomes Project. Recently, we developed a novel end-to-end variant discovery pipeline, goSNAP, which is highly scalable for use in hybrid computing environments, and dedicated for large-cohort sequencing data analysis, with sample size at thousand scale and beyond. The goSNAP pipeline can be deployed and optimized in different computing infrastructures, such as cloud service, national superclusters, or even local commodity clusters, so that the overall performance is not limited by the bottlenecks of any specific computing system. It allows for genetic variant discovery, genotyping, and phasing of several thousand samples in a one-month timeframe, thereby greatly reducing the data processing turnaround time without compromising data quality. We have played important roles in large-cohort and multiple-population sequencing projects. One of them is the 1000 Genomes Project, which is an international consortium aiming at the most detailed reference of human genetic variations. In this project, our group contributed to the discovery of single-nucleotide polymorphism (SNP) across 2,500 samples (of 26 populations from five continents), as well as the genotype-likelihood analysis of the discovered variants. We also participated in the CHARGE project (Cohorts for Heart and Aging Research in Genomic Epidemiology), which has so far sequenced 15,000 samples whole-exome wide and 5,000 samples whole-genome wide. All samples in the CHARGE project have very detailed phenotypes tracked for decades, and when combined with the genotype data, they are ideal for the discovery of novel candidate genes for common disease studies. We developed goSNAP to achieve, for the first time, high-quality variant calling and imputation of 5,000 whole-genome sequencing samples in six weeks. We have also gathered extremely low-coverage (1X) sequencing data from 200 South Asian samples, belonging to eight subgroups of the Indian population. The Indian population has a very complex demographic history and the haplotypes were not well studied until recently. Hence, it is scientifically very intriguing to analyze the sequencing data of this population. We also used this study as a proof of concept that we can do population genetics with extremely low-coverage data with the power of our variant discovery tools.
Finally, our group also does clinical analysis for personal genomics, for instance, the factor VIII in the blood clotting disorder, hemophilia, and other diseases using high-coverage whole-exome sequencing data. In collaboration with other research groups at Baylor, we also do comparative genomics using the high-throughput sequencing data of primate species such as Rhesus monkeys, which are genetically very close to humans. By analyzing their genetic variants, we are able to identify exciting moments in the demographic history of Rhesus monkeys and study how their population evolved over thousands and even millions of years.
Q: What are some of the trends you are seeing in next-gen sequencing?
A: Recently, the cost of sequencing has greatly reduced because of the revolutionary advances in sequencing technology. It’s now possible to do whole-genome sequencing for less than $1,000 per genome, and the largest sequencing centers nowadays have the capacity of sequencing tens to hundreds of thousands of individuals each year. Many startup companies are collecting DNA samples from patients for pursuing diagnostic and treatment options. A lot of information technology (IT) companies are focusing on developing databases to analyze and manipulate the vast amounts of data generated. As sequencing gets faster and cheaper, personal genomics will become more realistic and will benefit many people, from individuals to large populations. Large-scale studies, both in terms of sample size and in terms of whole-genome sequencing versus targeted sequencing, can help us understand the population demographic history and evolution of disease and mutations within the population. Many consortia are collecting and pooling together samples from large cohorts to do deep studies, which will provide an opportunity to study genetic variants and genes that are extremely rare and specific to certain populations.
Q: Can you elaborate on some of the challenges you are encountering?
A: Such high-throughput sequencing methods provide many opportunities, but also increase the challenges. In terms of computational challenges resulting from the large amount of data, we have to reduce the turnaround time for data processing and analysis. Today we can do whole-exome sequencing in a few hours and whole-genome sequencing in a day for one individual. But once you have thousands or millions of samples, it will surely take a significant amount of time and computational resources, especially if one needs to maintain or even improve the quality of results. Thus, it’s not trivial to put in the effort to reduce the computational cost of sequence alignment, variant discovery, annotation, and such for large-scale sequencing projects. On the other hand, when you are aggregating large amounts of heterogeneous sequencing or variation data, it becomes increasingly challenging to consolidate and analyze them. In the next few years, people will be focusing on developing databases that facilitate the data interpretation and prioritization. Data generation and collection being only one aspect of the game, translating the data into knowledge that can be applied in a clinical setting and for personal medicine will be even more crucial. We will need more improvements in the informatics and database technology to get this done.
Another challenge is how to efficiently update and query the databases. In the high-throughput sequencing era, the database has to be scalable, so that it can be seamlessly updated almost in real time, and performant, so that millions of researchers and doctors can query the up-to-date database with no latency. If there is a lot of heterogeneity in the data, integrating multiple databases also becomes a challenge. The footprint of the database itself, and the data ingested into and extracted from the database, is also critical. Instead of archiving everything, you have to effectively retain only the amount of data essential to downstream analysis. All these are deeply related to the data format and database structure, which is a key issue when it comes to big data analysis.
Q: Any advice for lab managers working in this field?
A: As a manager, you have to think outside of the box, have a broader vision, and stay up to date with the revolutionary developments in the technology. It’s very helpful to collaborate with different groups, become familiar with public databases, and if possible, get involved in large-cohort projects. In particular, when working with high-throughput sequencing, you have to stay on top of the changes taking place on both the sequencing and the informatics side. While the sequencing cost will continue to drop, the cost for informatics will grow significantly if the traditional computational tools are used in a non-scalable way or if the new computing technology and resources are not effectively exploited. It might not have been emphasized as much in the past, but nowadays, knowing some aspects of the informatics—such as database, scalability, and high-performance computing—will certainly benefit lab managers in their projects involving next-generation sequencing. At the end of the day, this all relates back to the time and cost of getting things done.
Dr. Zhuoyi Huang is a Postdoctoral Associate at Baylor College of Medicine Human Genome Sequencing Center. He has expertise in large scale genomic computing involving next-generation sequencing data, cloud computing and high performance computing. He is an important contributor to the whole genome sequencing data processing and analysis in the 1000 Genomes Project and CHARGE project. Before joining the sequencing center, he received his PhD in Astrophysics and Astronomy in Naples Italy, with the Marie Curie FP6 Early Stage Researcher Fellowship, followed by a two-year Postdoc Fellowship in Rome working on space telescope design for the Europe Space Agency and image processing for the European Southern Observatory.