Identifying new genomic regions encoding gene family members
In the second step, BITACORA uses TBLASTN to search the genome sequences for regions encoding homologs of the proteins included in the uFPDB but not annotated in the uGFF. BITACORA implements two different approaches for generating novel gene models from TBLASTN results (set with the “gemoma” parameter). For the one hand, BITACORA implements the GeMoMa tool, a homology-based gene prediction program that uses amino acid sequence and intron position conservation to reconstruct gene models from BLAST hits (Keilwagen, Hartung, & Grau, 2019; Keilwagen, Hartung, Paulini, Twardziok, & Grau, 2018; Keilwagen et al., 2016). The second approach is based on a “close proximity” strategy. Under this strategy, all independent TBLASTN hits (i.e., after merging all alignments that overlap in TBLASTN results) located in the same scaffold and separated by less than a predetermined distance (set with the “intron distance” parameter), are connected to form a unique gene model. This step intends to join all coding exons of the same gene based on the average intron length in the focal genome. We provide some scripts to estimate this average length from the input GFF (see Supplementary Material).
Finally, to avoid reporting inaccurate gene models due to artifactual gene fusions in dense gene clusters or any other possible errors (regardless of which algorithm of the abovementioned has been applied), BITACORA will check for the presence of the gene family-specific protein domain (using the HMM profile in FPDB), and only reports in the curated dataset those gene models containing the domain. In addition, all proteins are tagged with a label that indicates the number of different domains in the sequence (Ndom). This final filtering step can be relaxed using the BITACORA ”genomicblastp” option, which evaluates the presence of positive hits in either HMMER, or BLASTP searches against the proteins in FPDB (see Supplementary Material for details).