Full database construction
After the core database was created, orthology databases including COG, arCOG, KOG, eggNOG and KEGG were searched against the core database. There were two purposes for comparing the databases. The first was to increase the comprehensiveness of the core database. The second was to identify homologous gene families and include them in the full database, thereby reducing false positives in database searching (Tu et al., 2019). In addition, corresponding sequences (As metabolic gene families) from NCBI RefSeq database (Identical Protein Groups) of bacteria, archaea, and eukarya were identified, extracted, and merged. The coverage of As metabolizing functional species in AsgeneDB was determined by comparing the full database against NCBI RefSeq (options: -evalue 1e-6 -id 60). Complete taxonomic level information of sequences was determined used TaxonKit (Shen & Xiong, 2019). Finally, the sequence ID and genes were matched with taxonomic information to generate the taxonomy file. Sequences of both As metabolic gene families and homologous gene families were clustered by cd-hit (Fu, Niu, Zhu, Wu, & Li, 2012) at 100% identity. All representative sequences and related information were checked and used to construct AsgeneDB (Figure 1b).