Optional search round and final output
Finally, BITACORA can also be used to perform a second search round using as the input data all proteins obtained in steps 1 and 2 (sFPDB database). This additional step (step 3 in Fig 1) is especially useful for searching remote homologs undetected in the first round. The final BITACORA outcome will include 1) an updated GFF file with both b-curated and b-novel gene models. 2) All non-redundant proteins predicted from these feature annotations (in a FASTA file). 3) Two BED files, one with the coordinates of all independent TBLASTN hits found in the genome sequence, and the other with only those hits that would encode novel putative exons and, 4) all protein sequences found in all steps.
Additional features
BITACORA could be also used in the absence of either a reference genome for the target species (e.g. for transcriptomic studies; Protein mode) or a precompiled GFF (e.g. for non-annotated genomes; Genome mode); in these cases, the input should be a FASTA file with the set of predicted proteins or the genome sequences, respectively (see Supplementary Material for alternative usage modes). With BITACORA, we also distribute a series of scripts to perform some useful tasks, such as estimating intron length statistics from a GFF, converting GFF to GTF format, and retrieving all protein sequences encoded by the features of a GFF file. Furthermore, to better adjust to the particularities of each genome, BITACORA allows the user to specify the values of the most important parameters, such as the E -value for BLAST and HMMER searches, the number of threads in BLAST runs, and the algorithm to build novel gene models from TBLASN hits.