Core database construction
An
improved pipeline based on previous research was used to
build
AsgeneDB (Tu et al., 2019; Yu et al., 2021). Firstly, the core database
was manually constructed based on the current knowledge and
literature
of As metabolism (S.-C. Chen et al., 2020; H.-T. Wang et al., 2019; C.
Zhang et al., 2021; Zhu et al., 2017).
As
metabolic genes in KEGG were also referenced (Kanehisa et al., 2016).
Target
sequences were downloaded from the Swiss-Prot and TrEMBL databases
(The UniProt Consortium, 2017) by creating and refining keywords for
each gene family involved in As metabolic pathways (including gene and
protein names). To ensure the accuracy of AsgeneDB, the seed sequences
of each gene family were checked manually based on their annotations and
similarity to other sequences, especially for sequences with no
reference sequence in Swiss-Prot. For each gene family, a self-vs.-self
usearch (version 11.0, 30% global identity cutoff) was then performed
to generate a distance matrix between different sequences.
A
nearest neighbor clustering procedure was then carried out to cluster
sequences into
groups.
The
outlier sequences were then checked again to confirm their annotation
information in Swiss-Prot and TrEMBL and to remove abnormal sequences.
The remaining sequences were then retained as the core database for As
metabolic gene families (Figure 1a).