Diverse microbial communities are found all over the planet, from ocean and soil ecosystems, to our own respiratory and intestinal systems. Only a small proportion of microbes are disease-causing pathogens, whereas many are commensals that regulate important processes, such as nutrient and energy cycling in natural ecosystems, and metabolism, aging, and disease susceptibility in humans. All of these microbial effects are driven by viruses that infect and co-evolve with the microbe, and several technological advances are enhancing scientists’ ability to “see” and study these viruses.
Sequence-based approaches
It is estimated that our planet contains 10^31 viruses, most of which infect microbes as opposed to living things we can see such as plants, animals, and humans. However, only a few hundred species with medical relevance are well-studied, explains Dr. Matthew Sullivan, professor of microbiology and director of the Center of Microbiome Science at Ohio State University. “This is due to most microbes, and therefore their viruses, not being in culture,” he says.
A very small percentage of bacterial species on Earth are culturable; however, virtually all species can be identified based on the 16S ribosomal RNA (16S rRNA) sequence contained within their DNA. “Notably, the 16S gene can be used to amplify that gene marker from uncultured cells from the environment, and is a way to ‘see’ virtually all of the bacteria in a complex system,” says Sullivan. Unlike bacteria, there is no single gene shared among all virus species. The inability to culture viruses, combined with the lack of a universal “barcode” gene makes them especially challenging to study.
“To better understand viral roles in complex communities, sequence-based approaches have emerged since viral sequence space is so much larger than what we can culture,” explains Sullivan. “Further, metagenomic sequencing techniques, including shotgun sequencing or community genomics, are powerful for identifying and studying new viruses and their roles in the ecosystem,” he says. “This is because viruses do not share a single gene, so there is no ‘16S’ or ‘barcode gene’ to use, which means we have to sequence as much of the genomes as possible.”
According to Sullivan, “the process is becoming more systematic, and largely involves using either in silico tools to identify and then study virus sequences in existing bulk metagenomic datasets, or preparing virus-enriched or virus-purified particles (for example, through filtration and/or density gradients), and then applying metagenomic sequencing.”
Initially, researchers were limited in their ability to quantitatively evaluate the impacts of viruses on microbial ecosystem functions. Other than for a few cultivated viruses, and double-stranded DNA (dsDNA) genomes that were captured in quantitative viral metagenomes (viromes), the ecology of non-dsDNA viruses including single-stranded DNA (ssDNA) viruses was largely unknown.
Using mock viral communities including ssDNA and dsDNA viruses, Sullivan and his colleagues evaluated the capability of a sequencing library preparation approach for quantitative amplification of both ssDNA and dsDNA templates. This approach not only confirmed existing library preparation methods were biased for ssDNA templates, but when the new approach was applied to viral DNA from freshwater and marine samples, it provided a first estimate of ssDNA abundance in these systems.
Another challenge arises when metagenomics is applied to samples that yield only a few nanograms of DNA for library construction. These types of samples necessitate a DNA amplification step before or after adapter ligation to create a sequencing library; however, many of these approaches have been shown to produce strong amplification biases. Sullivan and his colleagues identified a strategy that combines read deduplication and a specific assembly algorithm to optimize de novo genome assembly from PCR-amplified metagenomes.
Sullivan notes that the majority of viruses that result from sequencing studies are new, and that novel analytical approaches have been developed to address the unknown sequence space (for example, protein clustering and gene-sharing networks). Sullivan and his colleagues have been working on democratizing a “viral ecogenomic toolkit” through iVirus, a collection of software apps, database resources, and computing all wrapped up in the NSF CyVerse collaborative cyberinfrastructure.
This toolkit also addresses some of the analytical challenges posed by newly generated metagenomes: the combination of large volumes of data, analytical tools that necessitate significant computational training, and the distribution of metagenomic datasets across a variety of repositories. The resource is designed to aid researchers in using viromics tools to study the ecological effects that they have on microbes.
Accurate inference requires standards
The increasing popularity of viromics techniques necessitates a set of bioinformatics standards for accurate analysis. Using in silico mock viral communities, Sullivan and his colleagues evaluated the sequence-to-ecological-inference pipeline, including read pre-processing and metagenome assembly, thresholds to estimate viral relative abundance based on read mapping to assembled contigs, as well as normalization methods. To create a mock community, the team randomly selected viral genomes of viruses infecting bacteria or archaea in the NCBI RefSeq database. Virome sequencing was then simulated for each mock community using the Next-generation Sequencing Simulator for Metagenomics (NeSSM). The simulator is designed to generate metagenome sequencing reads using parameters set by the user. Based on the simulations, they were able to determine benchmarks for selecting analysis cut-offs, to support accurate inferences from viral communities.
A single-cell approach
“Though under-recognized, virus infections in natural systems are often going to be inefficient as a natural virus will rarely meet its optimal host, with optimal physiology, in the wild,” explains Sullivan. To understand the mechanisms underlying sub-optimal virus-host interactions, researchers may turn to single-cell approaches.
According to Sullivan, single-cell approaches are valuable for determining whether the lack of a successful lytic infection resulted from lysogeny, a process by which the virus integrates into the host genome for passive replication, or from stalled lytic growth, by which the infecting phage takes control of host cell machinery. “However, multi-omics have been invaluable for understanding the many ways that infection efficiency is altered either on an alternate host or suboptimal growing conditions,” he adds.
Tools to support discovery-based science
As the vast majority of bacterial species, and therefore the viruses that infect them, are not amenable to traditional culture methods, Sullivan notes that traditional hypothesis-driven experimental science can be difficult to achieve. “That shows up time and again in reviews from researchers that might not work with complex communities,” he says.
A discovery-based approach may be an effective alternative, supported by more recent advances in technology and analytics. “I think we’re at a critical inflection point in science where the toolkit has changed so drastically that the discovery-based science empowered by scalable measurements, such as sequencing and analytics (machine learning, for example) is arguably as important as more traditional disciplinary experiments,” says Sullivan.
Indeed, machine learning methods have been applied to identify viral sequences in metagenomic data using deep learning, a method that relies on algorithms inspired by the structure of the human brain. DeepVirFinder, for example, is a reference-free, alignment-free machine learning method, and MARVEL (Metagenomic Analysis and Retrieval of Viral Elements) is a tool designed to predict dsDNA phage sequences in metagenomics bins. Sullivan and his colleagues have also recently released the updated VirSorter2 tool to leverage machine learning and capture dsDNA viruses, ssDNA viruses, RNA viruses, and giant viruses.
Ultimately, Sullivan believes that in broadly appreciating both hypothesis-driven and discovery-based science, innovation will flourish. Going forward, he hopes to apply the viral ecogenomics toolkit he and his colleagues developed to study viruses in the wild and in their ecosystem context. “With that toolkit in hand, we are now excited to apply it to diverse ecosystems to answer questions ranging from improving human health outcomes to quantifying speciation rates in hundreds of thousand-year-old viruses, to understanding how viruses of microbes impact ocean, soil, and other ecosystems.”
Recently, Ohio State has launched a new Center of Microbiome Science and an NSF-funded EMERGE Biology Integration Institute. “Together, these new collaborative efforts are helping push the boundaries of ‘genes to ecosystems’ science, to bring understanding to how microbiomes and their viruses impact the planet,” says Sullivan.
Certainly, with advanced sequencing-based approaches and powerful analytical tools, researchers can gain a deeper understanding of the multitude of viruses that affect our natural world and our physical selves.