Long ago, when interviewing as a prospective student at PhD programs across the country, a very stern and important professor at an august institution asked me to name the most significant development in molecular biology since the Watson-Crick structure of the DNA double helix. After I gave the most intelligent and elaborate version of “I don’t know” I could muster, he stared me down and proclaimed, “DNA sequencing” as if that were the only possible answer. In the biomedical sciences, where lab meeting—let alone peer review—often constitutes blood sport, perhaps at least some argument was warranted; on second thought, perhaps he was right.
The technology’s acknowledged pioneer, Frederick Sanger, is one of only four scientists to win two Nobel prizes, having first elucidated the amino acid sequence of insulin in the 1950s. He established the foundational idea that proteins have unique structures and properties that can be predicted, providing powerful early proof of the Central Dogma of molecular biology. At the time, there were no reliable means to examine stretches of nucleic acid sequence, and so to a large extent the Central Dogma was confirmed in reverse, with discoveries of key protein translation structures and mechanisms preceding the report of the first polynucleotide sequences in the mid-1960s. Maxam and Gilbert later devised a method of nucleotide sequencing using chemical cleavage followed by electrophoretic separation of DNA bases. Sanger improved upon this by employing primer extension and chain termination, a method that gained primacy with its decreased reliance on toxic and radioactive reagents. Applied Biosystems (Foster City, CA) provided the first automated Sanger sequencers, later incorporating innovations such as computerized data analysis and the polymerase chain reaction. Industrial-scale automation resulted in deposition of thousands of expressed sequence tags on GenBank and other databases. This provided an impetus for gene cloning and functional studies through the 1990s, and populated a lattice to be filled in by the Human Genome Project and consortia to generate complete reference sequences for important animal models.
At the time, the publication of the draft sequence of the human genome might have seemed like an act of culmination; however, it was a mere prelude to the flowering of a new age of discovery, largely predicated on deep genotypic investigation of context-dependent phenotypes. The resulting pressure on the sequencing data pipeline led quickly to significant technological changes that far surpassed the Sanger method in terms of cost and efficiency by flattening the workflow. These technologies are collectively known as nextgeneration sequencing (NGS), and rapidly identify and record nucleotide binding to complementary strands of amplified cDNA, in massively parallel synthesis reactions with a daily throughput in the hundreds of gigabases. As a point of comparison, the sequencing of a human genome using the Sanger method in 2001 at a cost of $100 million would be reduced to $10,000 by 2014 by using NGS, with a next-day turnaround. A short list of NGS applications that have evolved since NGS inception include the ability to:
- rapidly sequence whole genomes (genomics);
- quantify global gene expression patterns (transcriptomics);
- zoom in on regions of interest to deeply sequence target regions, including promoters/ enhancers and micro- and non-coding RNAs (targeted and exome sequencing);
- perform epigenetic analyses to uncover patterns in methylation and other post-translational modifications;
- achieve single-cell sequencing to define and identify rare cell types important in processes such as stem and progenitor cell differentiation, and the etiology of different cancers.
Although all NGS runs on the principle of massive parallel sequencing reactions, the modes of nucleotide incorporation and fluorescence detection in the synthesis reactions differ among commercially available platforms. The most widely used are the Illumina and Roche 454 systems, which differ functionally in that Roche handles longer reads with lower throughput, but with greater cost and a tendency to miscall polybase repeats. The expense and expertise required to operate next-generation sequencers has largely relegated them to dedicated core facilities; when purchasing reagents and preparing samples, investigators must consider what systems and models are available. Because the lower limit of detection can be in the range of picograms, sample preparation must be optimized to allow the deepest coverage and the highest signal-to-noise ratio possible to ensure data validity. Regardless of the nature of the starting material— genomic DNA, mRNA, DNA-protein complexes, etc.—the precondition for generating useful NGS data sets is a broad library of nucleic acids. As in so much of molecular biology, there is always a kit for that.
A generic workflow for library preparation is as follows: 1) sample collection, fragmentation via enzymatic digestion or shear forces, 2) end-repair and phosphorylation of 5’ ends, 3) polyA-tailing to allow ligation of oligo dT-based adapters, 4) and a high-fidelity PCR-based amplification step to generate a product with adapters at both ends, barcoded for identification of individual samples run as multiplex reactions in a single lane. Taking Illumina (San Diego, CA) products as an example, each library prep kit is engineered to appropriately modify and amplify the given starting material, while reducing the number of steps to avoid sample contamination. For RNA-seq studies, the TruSeq RNA kit executes a workflow of mRNA isolation, followed by fragmentation and first- and secondstrand cDNA synthesis. Methyl sequencing depends on the addition of a modification, with bisulfite treatment of genomic DNA converting cytosine residues into uracil while leaving methylcytosine unmodified. The Methyl Capture Epic library prep kit adds probe hybridization and bisulfite conversion steps in addition to fragmentation, adapter ligation, and amplification.
Finally, use of NGS technologies has become widespread, with genomic, transcriptomic, or epigenomic data sets almost a required aspect of high-impact papers. However, they provide a generalized snapshot that might not be an accurate representation of what is biologically relevant in a tissue or in a sample of mixed cell types. Single-cell sequencing is a promising development to satisfy this discrepancy, but has the biggest potential flaws in sample and library preparation. Specialized providers such as Fluidigm (South San Francisco, CA) and 10X Genomics (Pleasanton, CA) are innovating microfluidic apparatuses to handle comprehensive sample collection and library amplification protocols.
For additional resources on next-generation sequencing, including useful articles and a list of manufacturers, visit www.labmanager.com/NGS