Input data files
BITACORA requires: i) a data file with the genome sequences (in FASTA
format); ii) the associated GFF file with annotated features (either in
GFF3 or GTF formats; features must include both transcript or mRNA, and
CDS); iii) a data file with the predicted proteins included in the GFF
(in FASTA format); and iv) a database (here referred as FPDB database)
with the protein sequences of well annotated members of the gene family
of interest (focal family; in FASTA format) along with its HMM profile
(see Supplementary Material for a detailed description of FPDB
construction). Since sequence similarity-based searches are very
sensitive to the quality of the proteins in FPDB, it is important to
include in this database highly curated proteins from closely related
species. This is especially important for the annotation of very old or
fast-evolving gene families. Also, the use of a HMM profile increases
the likelihood of identifying sequences encoding new members; these
profiles can be obtained from external databases (such as PFAM) or built
using high quality protein alignments with the program hmmbuild(Finn et al. , 2014). Before starting the analysis, BITACORA
checks whether input data files are correctly formatted; otherwise, it
will suggest some format converters distributed with the program (see
Troubleshooting section in Supplementary Material).