BITACORA application example
To demonstrate the performance of BITACORA in annotating gene family members in a group of genomes of different assembly quality, we present an extended report of the results in Vizueta et al., (2018). Specifically, we selected two of the arthropod chemosensory gene families, insect gustatory receptors (GR) and Niemann-Pick type C2 (NPC2) proteins (Pelosi, Iovinella, Felicioli, & Dani, 2014; Robertson, 2015) in a subset of seven of the eleven chelicerate genomes surveyed in this study (Table 1; Fig. 2). We selected these gene families since they widely differ in the number of members and protein length. Whereas the GR is a large gene family that encode seven-transmembrane receptors of about 400 amino acids long, the NPC2 have few members and encode shorter proteins (an average of about 150 amino acids); despite the different length, both gene families have a similar average number of exons per gene in the surveyed species. Furthermore, to validate the accuracy of our software in gold standard annotated genomes, we also checked the performance of BITACORA in identifying these members in the genome ofDrosophila melanogaster .
For the analysis, we retrieved genome sequences, annotations and predicted peptides of D. melanogaster (r6.31, FlyBase; Adams et al., 2000), the scorpions Centruroides sculpturatus (bark scorpion, genome assembly version v1.0, annotation version v0.5.3; Human Genome Sequencing Center (HGSC)) and Mesobuthus martensii (v1.0, Scientific Data Sharing Platform Bioinformation (SDSPB)) (Cao et al., 2013); and of the spiders Acanthoscurria geniculata (tarantula, v1, NCBI Assembly, BGI) (Sanggaard et al., 2014), Stegodyphus mimosarum (African social velvet spider, v1, NCBI Assembly, BGI) (Sanggaard et al., 2014), Latrodectus hesperus (western black widow, v1.0, HGSC), Parasteatoda tepidariorum (common house spider, v1.0 Augustus 3, SpiderWeb and HGSC) (Schwager et al., 2017) andLoxosceles reclusa (brown recluse, v1.0, HGSC).
In addition, and with a benchmarking purpose, we compared the performance of BITACORA with Augustus PPX, a method that also uses protein profiles to improve automatic annotations of gene family members (–proteinprofile; Keller et al., 2011; Mario Stanke, Schöffmann, Morgenstern, & Waack, 2006), in annotating GR and NPC2 copies in the same seven chelicerate genomes. Strikingly, BITACORA uncovered the identification of thousands of new gene models previously undetected in chelicerates, even after applying Augustus-PPX (Table 1; see also supplementary data in Vizueta et al. 2018 to find the BITACORA curated sequences). For instance, in the bark scorpion Centruroides sculpturatus , the automatic annotation pipelines show 24 GR encoding sequences, while BITACORA was able to identify and annotate 1,234 genes or gene fragments, for the only 307 recovered with Augustus-PPX (Table 1; Supplementary table S1). Globally, BITACORA identified, annotated and curated 3,570 sequences encoding GR proteins across the seven chelicerate genomes (3,466 of which were absent in the available GFF for this species), while Augustus-PPX only predicted 1,638 gene models for this family (Table1; Supplementary table S1). It is largely known that this gene family evolves rapidly in arthropods, both in terms of sequence change and repertory size, encoding in the same genome very recent and distantly related receptors as well as pseudogenes. Since some of these receptors show a very restricted gene expression pattern (expressed in specialized cells and tissues involved in chemoreception), their transcripts are often missing in RNA-seq data sets, which are one of evidences used for the automatic annotation of the genomes (Joseph & Carlson, 2015; Robertson, 2015; Vizueta et al., 2017; Zhang, Zheng, Li, & Fan, 2014). This fact, together with the huge divergence that exhibit many copies (old duplication events and/or rapid evolution), are probably the causes of the low accuracy of both automatic annotation and Augustus-PPX.
The members of the NPC2 family, on the contrary, are much more conserved at the sequence level and show higher levels of gene expression in arthropods (Pelosi et al., 2014). As expected, the number of newly identified copies is much lower than in the case of GRs. Even that, BITACORA was able to detect 44 novel NPC2 encoding sequences, raising the total annotated repertoire in these species from 75 to 119 (Table 1). In this case, Augustus-PPX was able to recover 97 gene models for this gene family, which improves the performance of previous automatic annotations, but still is outperformed by BITACORA. Importantly, Augustus-PPX predicted thousands of gene models that are not real members of the focal gene family (Supplementary table S1), requiring further actions to separate gene family copies from false allocations.
Finally, both methods correctly annotated all members of the GR and NPC2 families in D. melanogaster genome, demonstrating the real utility of these tools in the genome drafts of non-model organisms. It is worth noting, however, that a non-negligible number of these novel identified genes in chelicerate genomes are incomplete (about 40% and 63% of the GR and NPC2 members, respectively). This feature can be partially explained by the poor genome assembly quality (indicated as the N50 and number of scaffolds), or by the low number of annotated proteins in the input GFF. Despite BITACORA can be useful under such low-quality data, it will compromise its performance in terms of complete gene models.