Comparing AsgeneDB against established orthology databases
To show the necessity of building a manually managed As metabolism gene database, we compared the coverage of As metabolism genes (subfamily; Figure 2) in AsgeneDB to the main public orthology databases. Of the 59 gene subfamilies recruited to AsgeneDB fewer than a third were found in any other single database with the largest proportion found in KEGG (16 gene subfamilies), followed by COC (13 gene subfamilies), eggnog (10 gene subfamilies), arCOG (6 gene subfamilies), and KOG (2 gene subfamilies). AsgeneDB further contains several key As metabolic gene families that are missing in the four common orthology databases, including As(V) respiratory reductase (arrA and arrB ), organic As efferent osmotic enzyme (arsJ and arsP ), pentavalent As(V) reductase (GstB ) and trivalent As(III) oxidase (aioR ,arxR , arxA andarxB ). In addition to containing more genes, the families defined by AsgeneDB were considered one homologous group in the four publicly available homologous databases. For example, both arsB and acr3 are involved in arsenite efflux even though they belong to two different phylogenetic clades (Achour et al., 2007; Cai et al., 2009; Rosen, 2002). However, in KEGG, COG and eggNOG databases, arsB and ACR3 are mixed into one orthology group (Table S3). Similarly, arsA , ASNA1 andGET3 are homologous genes (Hemmingsson, Zhang, Still, & Naredi, 2009; Kurdi-Haidar et al., 1996) that have no clear distinction in COG, KEGG and KOG. AsgeneDB is therefore a superior database for determining gene families related to As metabolism and has obvious advantages over existing resources in terms of coverage, representativeness and accuracy.