Abstract
Amplicon sequencing is an effective and increasingly applied method for studying viral communities in the environment. Here, we present vAMPirus, a user-friendly, comprehensive, and versatile DNA and RNA virus amplicon sequence analysis program, designed to support investigators in exploring virus amplicon sequencing data and running informed, reproducible analyses. vAMPirus intakes raw virus amplicon libraries and, by default, performs nucleotide- and protein-based analyses to produce results such as sequence abundance information, taxonomic classifications, phylogenies, and community diversity metrics. The vAMPirus pipelines additionally include optional approaches that can increase the biological signal-to-noise ratio in results by leveraging tools not yet commonly applied to virus amplicon data analyses. In this paper, we validate the vAMPirus analytical framework and illustrate its implementation into the general virus amplicon sequencing workflow by recapitulating findings from two previously published double-stranded DNA virus datasets. As a case study, we also apply the program to explore the diversity and distribution of a coral reef-associated RNA virus. vAMPirus is incorporated with the Nextflow workflow manager, offering straightforward scalability, standardization, and communication of virus lineage-specific analyses. The vAMPirus framework itself is also designed to be adaptable; community-driven analytical standards will continue to be incorporated as the field advances. vAMPirus supports researchers in revealing patterns of virus diversity and population dynamics in nature, while promoting study reproducibility and comparability.
Introduction
From the human gut to sediments in the deep ocean, viruses are abundant, diverse, and shape the systems they inhabit (Breitbart et al., 2018; Correa et al., 2021; Suttle, 2007). The advent of high-throughput sequencing (HTS) techniques like amplicon sequencing has transformed the field of virology, illuminating the currently unculturable virosphere (Labadie et al., 2020; Metcalf et al., 1995; Paez-Espino et al., 2017; Zayed et al., 2022) and helping identify the impacts of viruses on ecosystem and host function (Braga et al., 2020; Breitbart et al., 2018; Thurber et al., 2017; Uyaguari-Diaz et al., 2016). Amplicon sequencing is a targeted, polymerase chain reaction (PCR)-based HTS approach that allows deep characterization of genetic variants within virus populations (Short et al. 2010). The targeted nature of amplicon sequencing reduces the economic and computational investment required for spatiotemporal investigations of virus communities at ecologically relevant scales (see Finke & Suttle, 2019; Frantzen & Holo, 2019; Grupstra et al., 2022; Gustavsen & Suttle, 2021; Howe-Kerr et al., 2022; Montalvo-Proaño et al., 2017). The number of studies leveraging virus amplicon sequencing has increased rapidly over the last two decades (e.g., 16 peer-reviewed publications in 1998 compared to 127 in 2021 based on a Web of Science search of ‘virus amplicon sequencing’, November 2022).
The general virus amplicon sequencing workflow includes: 1. Extraction of virus nucleic acid (DNA or RNA), 2. PCR amplification of virus marker gene or transcript, 3. HTS of virus marker gene amplicons, and 4. Bioinformatic analysis of sequencing data (Short et al., 2010; Figure 1). The effective analysis and interpretation of amplicon sequencing data relies on biologically accurate binning of marker gene sequences into taxonomically or ecologically distinct units. Recognizing viral taxa or ecotypes, however, can be challenging. For example, non-model viruses have limited baseline information available to inform the selection of clustering thresholds. Other viruses, such as RNA viruses, have error-prone polymerases and produce quasispecies, a population structure consisting of large numbers of variant genomes (Domingo & Perales, 2019) that may not be easily resolved by the same clustering percentage. Amplicon sequence variants (ASVs) are a promising non-clustering-based approach for virus amplicon analyses that offers high precision and biological accuracy as error-derived sequence variants are removed during ASV generation (Callahan et al., 2017; Edgar, 2016b). In addition, since the identity of an ASV is not specific to a given dataset (as identity can be in clustering of marker gene sequences into de novo OTUs based on a percent identity value, Callahan et al., 2017), ASVs and their unique translations (‘aminotypes’, see Grupstra et al., 2022) can be compared directly among studies (Callahan et al., 2017).
To promote the standardization, reproducibility and cross-comparison of DNA and RNA virus amplicon sequence analyses, we developed the automated bioinformatics tool, vAMPirus (github.com/Aveglia/vAMPirus). vAMPirus intakes raw (unprocessed) virus amplicon libraries, performs all read processing and diversity analysis steps, and produces reports detailing results (e.g., relative abundance plots, community diversity metrics) with interactive figures and tables. vAMPirus supports initial explorations of viral amplicon sequence datasets via a ‘DataCheck’ pipeline, which generates an HTML report with information on data quality and sequence diversity. Results from the exploratory DataCheck pipeline can then be used to optimize parameters in the read processing or ASV generation steps within the vAMPirus ‘Analyze’ pipeline; this can improve the signal-to-noise ratio in downstream analyses. vAMPirus is integrated with the Nextflow workflow manager, which uses a configuration file that can be shared among investigators, facilitating the standardization and dissemination of virus amplicon sequence analyses across projects and research groups. To that end, we also created the vAMPirus Analysis Repository (https://zenodo.org/communities/vampirusrepo/) to act as a central location for all published vAMPirus analyses. vAMPirus is intended to be accessible to researchers with a range of bioinformatics experience levels, and includes substantial help documentation with step-by-step instructions for running the tool (https://github.com/Aveglia/vAMPirus/blob/master/docs/). By facilitating the standardization of viral lineage-specific analyses and increasing the signal-to-noise ratio in community diversity analyses, vAMPirus will enhance the effectiveness of virus amplicon studies and lead to a more developed understanding the global virosphere.