2.3 vAMPirus Analysis Repository
To encourage and simplify the dissemination of parameters and non-read files needed to reproduce vAMPirus analyses, we created the ‘vAMPirus Analysis Repository’ (zenodo.org/communities/vampirusrepo/). The vAMPirus Analysis Repository is a Zenodo Community intended as a central location where investigators can deposit vAMPirus configuration files, metadata files, databases used for taxonomy assignment or ASV filtering, and any other files required to reproduce an analysis. Instructions and recommendations for submission are available in the vAMPirus manual (shorturl.at/uCO28). Once uploaded, submissions to the vAMPirus Analysis Repository are given a DOI.
Validating the vAMPirus workflow with published double-stranded DNA (dsDNA) virus datasets
We assessed the functionality and performance of vAMPirus’ analytical workflow using amplicon sequencing datasets from two previously published dsDNA virus studies (Table 1). Research questions associated with each study are used as examples in Figure 1A (Finke & Suttle 2019; Figure 1A, Q1; Frantzen & Holo 2019; Figure 1A, Q2). For each dataset, we ran a vAMPirus analysis that reproduced the analysis from the associated published paper as closely as possible. For example, if a study generated de novo OTUs based on 97% nucleotide identity, the vAMPirus equivalent was ncASVs generated at 97% nucleotide identity with similar data quality control constraints. We then compared the results of the vAMPirus-based analyses to the findings described in each source manuscript. In brief, vAMPirus identified the same biological patterns as those published by Finke & Suttle (2019, Figure 3) and Frantzen & Holo (2019, Figure 4) from their respective sequence datasets, and detected additional (previously unreported) virus diversity (Table 1). For example, Finke and Suttle (2019) reported increased cyanophage community alpha diversity in samples collected from sites with higher salinity (>27.5 practical salinity units, Figure 3-I, II); this pattern was present in the corresponding vAMPirus results (Figure 3-III, IV, V, VI), which included 86% more cyanophage pcASVs relative to the number of OTUs reported in Finke and Suttle (2019; Table 1). Similarly, the patterns of lactococcal phage OTU richness and relative abundances per sample reported by Franzten and Holo (2019; Figure 4-I) were also present in the vAMPirus results (Table 2; Figure 4-II). vAMPirus reported 43% more lactococcal phage ncASVs, relative to the OTUs reported by Frantzen and Holo (2019; Table 1, Figure 4). In addition, vAMPirus ASV-level analysis (Figure 4-III) revealed high lactococcal phage nucleotide-level diversity (n=531), yet aminotyping results (Figure 4-IV) suggest that the mutations underlying this richness mostly result in synonymous mutations: ASV sequences translated to only 29 aminotypes. Aminotype phylogrouping (see Section 2.2.2) of these data with TreeCluster highlighted a previously hidden overlap of lactococcal phage diversity across samples and dairy plants (Figure 4-VI).
Some variation between results obtained from vAMPirus and previous publications was expected, as the pipelines used in these comparisons were not identical. The only striking difference between the original results (in Finke and Suttle 2019 and Frantzen and Holo 2019) and those produced by vAMPirus is the higher number of pcASVs and ncASVs (respectively) identified via the latter analytic pipeline. Taxonomy results generated with vAMPirus by DIAMOND blastx aligning sequences to the NCBI virus RefSeq database verified that the pcASVs and ncASVs are of cyanophage and lactococcal phage origin, respectively (Supplemental Figures S4 and S5). The higher diversity identified by vAMPirus may be attributable to differences in reference database used (boutique versus NCBI-curated), handling of singletons, and other factors.
Table 1. Breakdown of test datasets used during vAMPirus development, including the methods and results from the original (published) analysis, as well as results from vAMPirus analysis. vAMPirus results were generated using de novo clustering of ASVs into ‘clustered ASVs’ (cASVs) based on pairwise nucleotide (ncASV) and protein (pcASV) sequence similarity. dsDNA = double-stranded DNA.