2.2.2 Amplicon sequence variants and ‘aminotypes’
By default, vAMPirus generates nucleotide-based (ASV) and protein-based (aminotype) results. ASVs support cross-study comparisons and offer a statistically supported view of virus sequence diversity, as biologically inaccurate sequences are removed during denoising (Callahan et al., 2017; Edgar, 2016b). However, ASV results for virus lineages with high mutation rates (e.g., RNA viruses with quasispecies heterogeneity) may still contain high levels of noise that mask biological patterns. It may be beneficial to group ASVs into distinct clusters based on genetic or ecological similarities in such use cases. In vAMPirus, ‘aminotypes’ (unique amino acid sequences, Grupstra et al. 2022) are generated by translating ASVs with VirtualRibosome (v2.0, Wernersson, 2006) and subsequently dereplicating these translations using the program CD-HIT (v4.8.1, Fu et al., 2012; Li & Godzik, 2006). As direct products of specific ASVs, aminotypes maintain sequence tractability, reproducibility, and comparability, and therefore differ from de novo OTUs or cASVs (see Section 2.2.3). The ‘aminotyping’ approach not only reduces noise; it also removes sequences with internal stop codons (a deleterious mutation) and reveals nonsynonymous mutations that may indicate differences in virus functionality (e.g., infection efficiency, host range; DeFilippis & Villarreal, 2000).
vAMPirus provides two additional (optional) ASV or aminotype “grouping” approaches that are alternatives to de novoclustering: Minimum Entropy Decomposition (MED) and phylogeny-based clustering or ‘phylogrouping’. MED is a method of sequence clustering that utilizes Shannon entropy (Shannon, 1948) to partition marker gene datasets into ‘MED nodes’ (Eren et al., 2015). With this approach, users identify sequence positions in a set of ASVs or aminotypes that are information-rich (positions of high variability) or information-poor (positions of high conservation) and use these positions to assign ASVs/aminotypes to ‘MED groups’ (sequences with identical bases at specified positions) (Eren et al., 2015). Users can also specify and assign sequences to MED groups based on sequence positions of interest (e.g. positions of a protein sequence known to influence a viral characteristic such as host cell attachment; see Harvey et al., 2021). Phylogrouping is performed with the TreeCluster program (v1.0.3, Balaban et al., 2019). With this approach, ASV or aminotype sequences are assigned to “phylogroups” based on user specified TreeCluster parameters and the phylogenetic tree produced during analysis (see Figure 4-V, VI). All grouping methods can be applied at the same time; coupled with the use of the Nextflow ‘–resume’ feature, adjusting specific parameters and generating new results to review and compare is straightforward and does not require re-running the entire DataCheck or Analyze pipelines.
2.2.3 Optional de novo sequence clustering
vAMPirus provides the option to perform de novo clustering of ASVs into ‘clustered ASVs’ or ‘cASVs’ based on pairwise nucleotide (ncASV) and/or protein (pcASV) sequence similarity using the programs VSEARCH (Rognes et al., 2016) and CD-HIT (Fu et al., 2012; Li & Godzik, 2006), respectively. cASVs differ from traditional de novo OTUs because for cASVs, denoising of sequences is done prior to clustering. The de novo clustering of ASVs is most useful for more developed virus systems where the degree of sequence divergence between taxonomically or ecologically distinct groups is known. Note that, from a methodological standpoint, representative sequences generated by a cASV approach exhibit the same issues as de novo OTUs (e.g., dataset dependence; see Callahan et al., 2017).