Input data files
BITACORA requires: i) a data file with the genome sequences (in FASTA format); ii) the associated GFF file with annotated features (either in GFF3 or GTF formats; features must include both transcript or mRNA, and CDS); iii) a data file with the predicted proteins included in the GFF (in FASTA format); and iv) a database (here referred as FPDB database) with the protein sequences of well annotated members of the gene family of interest (focal family; in FASTA format) along with its HMM profile (see Supplementary Material for a detailed description of FPDB construction). Since sequence similarity-based searches are very sensitive to the quality of the proteins in FPDB, it is important to include in this database highly curated proteins from closely related species. This is especially important for the annotation of very old or fast-evolving gene families. Also, the use of a HMM profile increases the likelihood of identifying sequences encoding new members; these profiles can be obtained from external databases (such as PFAM) or built using high quality protein alignments with the program hmmbuild(Finn et al. , 2014). Before starting the analysis, BITACORA checks whether input data files are correctly formatted; otherwise, it will suggest some format converters distributed with the program (see Troubleshooting section in Supplementary Material).