Introduction
The falling cost of high-throughput sequencing (HTS) technologies made them accessible to small labs, promoting a large number of genome-sequencing projects even in non-model organisms. Nevertheless, genome assembly and annotation, especially in eukaryotic genomes, still represent major limitations (Dominguez Del Angel et al., 2018). The unique genomic characteristics of many non-model organisms, often lacking pre-existing gene models (Yandell & Ence, 2012), and the absence of closely related species with well-annotated genomes, converts the annotation process in a big challenge. State-of-the-art pipelines for de novo genome annotation, like BRAKER1 (Hoff, Lange, Lomsadze, Borodovsky, & Stanke, 2016) or MAKER2 (Holt & Yandell, 2011), allow integrating multiple evidences such as RNA-seq, EST data, gene models from other previously annotated species or ab initiogene predictions (using software such as GeneMark, (Lomsadze, Burns, & Borodovsky, 2014), Exonerate (Slater & Birney, 2005), GenomeThreader (Gremme, Brendel, Sparks, & Kurtz, 2005), Augustus (M. Stanke & Waack, 2003; Mario Stanke, Diekhans, Baertsch, & Haussler, 2008) or SNAP (Korf, 2004). Some of these pipelines, such as BRAKER1, will only report those gene models with evidences. However, the gene models predicted by these automatic tools are often inaccurate, particularly for gene family members. Furthermore, these predictions can be especially inaccurate for medium or low-quality assemblies, which is a quite common situation in the increasing large number of genome drafts of non-model organisms used in molecular ecology studies. The correct annotation of gene families frequently requires additional programs, such as Augustus-PPX (Keller, Kollmar, Stanke, & Waack, 2011a), or semi-automatic, and even manual approaches, that evaluate the quality of supporting data. This latter task is usually performed in genomic annotation editors, such as Apollo, which give researchers the option to work simultaneously in the same annotation project (Lee et al., 2013).
There are a number of issues affecting the quality of gene family annotations, especially for either old or fast evolving families (Yohe et al., 2019). First, new duplicates within a family usually originate by unequal crossing-over and are found in tandem arrays in the genome, being the more recent duplicates also the physically closest (Clifton et al., 2017; Vieira, Sánchez-Gracia, & Rozas, 2007). This configuration often causes local miss-assemblies that result in the incorrect or failed identification of tandem duplicated copies (i.e., it produces artifact, incomplete, or chimeric genes along a genomic region). Secondly, the identification and characterization of gene copies in medium- to large-sized families tends to be laborious, requiring data from multiple sources, including well-annotated remote homologs and hidden Markov model (HMM) profiles. Certainly, the fine and robust identification and annotation of the complete repertory of a gene family in a typical genome draft is a challenging task that requires important additional efforts, which are very tedious to perform manually.
In order to facilitate this curation task, we have developed BITACORA, a bioinformatics pipeline to assist the comprehensive annotation of gene families in genome assemblies. BITACORA requires of a structurally annotated genome (GFF and FASTA format) or a draft assembly, and a curated database with well-annotated members of the focal gene families. The program will perform comprehensive BLAST and HMMER searches (Altschul, 1997; Eddy, 2011) to identify putative candidate gene regions (already annotated, or not), combine evidences from all searches and generate new gene models. The outcome of the pipeline consists in a new structural annotation (GFF) file along with their encoded sequences. These output sequences can be directly used to conduct downstream functional or evolutionary analyses or to facilitate a fine re-annotation in genome browsers such as Apollo (Lee et al., 2013).