Full database construction
After the core database was created, orthology databases including COG,
arCOG, KOG, eggNOG and KEGG were searched against the core database.
There were two purposes for comparing the databases. The first was to
increase the comprehensiveness of the core database. The second was to
identify homologous gene families and include them in the full database,
thereby reducing false positives in database searching (Tu et al.,
2019).
In
addition, corresponding sequences
(As
metabolic gene families) from NCBI
RefSeq
database (Identical Protein Groups) of bacteria,
archaea,
and
eukarya
were identified, extracted, and
merged.
The coverage of As metabolizing
functional
species in AsgeneDB was determined by comparing the full database
against NCBI RefSeq (options:
-evalue
1e-6 -id
60).
Complete
taxonomic level information of sequences was determined used TaxonKit
(Shen & Xiong, 2019). Finally, the sequence ID and genes were matched
with taxonomic information to
generate
the
taxonomy file. Sequences of both
As
metabolic gene families and homologous gene families
were
clustered by cd-hit (Fu, Niu, Zhu, Wu, & Li, 2012) at 100% identity.
All representative sequences and related information were checked and
used to construct AsgeneDB
(Figure
1b).