2.2.4 vAMPirus DataCheck pipeline and report
The vAMPirus DataCheck pipeline can help investigators determine the
optimal parameters for read processing, ASV generation, and other
downstream analyses conducted in the Analyze pipeline. The DataCheck
pipeline is particularly beneficial for investigators working on nascent
virus systems because it facilitates the informed establishment of
gene-, lineage- or system-specific analysis standards. The pipeline
produces an HTML report that displays information such as sequencing
success per sample, read characteristics (e.g., read length, GC
content), and ASV/aminotype sequence properties. The DataCheck pipeline
also provides insight into the ASV sequences by clustering them across a
range of nucleotide and amino acid similarities and plotting the
resultant number of cASVs per similarity value. Briefly,
nucleotide-based de novo cASVs are produced by clustering ASV
sequences using 24 different percent identity values (55%, 65%, 75%,
80-100%) with VSEARCH. To generate de novo pcASVs, ASVs are
first translated using the program VirtualRibosome (v2.0, Wernersson,
2006), then clustered into de novo pcASVs using the same 24
percent identities with the program CD-HIT (v.4.8.1, Fu et al., 2012; Li
& Godzik, 2006). For each percent identity value, the number of ncASVs
and pcASVs is quantified and visualized as a scatter plot in the
DataCheck report. This is a common approach used to determine the
clustering percentage (e.g., Gustavsen and Suttle 2021): the percent
similarity at which there is no longer a linear rise in the number of
cASVs (the inflection point) is selected for sequence clustering.
Optionally, users can also apply the program oligotyping (Eren et al.,
2015) to calculate Shannon entropy values per sequence position for both
ASV and aminotypes, which is then displayed in the report. An example
vAMPirus DataCheck report is available at
github.com/Aveglia/vAMPirusExamples.