Introduction
The falling cost of high-throughput sequencing (HTS) technologies made
them accessible to small labs, promoting a large number of
genome-sequencing projects even in non-model organisms. Nevertheless,
genome assembly and annotation, especially in eukaryotic genomes, still
represent major limitations (Dominguez Del Angel et al., 2018). The
unique genomic characteristics of many non-model organisms, often
lacking pre-existing gene models (Yandell & Ence, 2012), and the
absence of closely related species with well-annotated genomes, converts
the annotation process in a big challenge. State-of-the-art pipelines
for de novo genome annotation, like BRAKER1 (Hoff, Lange,
Lomsadze, Borodovsky, & Stanke, 2016) or MAKER2 (Holt & Yandell,
2011), allow integrating multiple evidences such as RNA-seq, EST data,
gene models from other previously annotated species or ab initiogene predictions (using software such as GeneMark, (Lomsadze, Burns, &
Borodovsky, 2014), Exonerate (Slater & Birney, 2005), GenomeThreader
(Gremme, Brendel, Sparks, & Kurtz, 2005), Augustus (M. Stanke & Waack,
2003; Mario Stanke, Diekhans, Baertsch, & Haussler, 2008) or SNAP
(Korf, 2004). Some of these pipelines, such as BRAKER1, will only report
those gene models with evidences. However, the gene models predicted by
these automatic tools are often inaccurate, particularly for gene family
members. Furthermore, these predictions can be especially inaccurate for
medium or low-quality assemblies, which is a quite common situation in
the increasing large number of genome drafts of non-model organisms used
in molecular ecology studies. The correct annotation of gene families
frequently requires additional programs, such as Augustus-PPX (Keller,
Kollmar, Stanke, & Waack, 2011a), or semi-automatic, and even manual
approaches, that evaluate the quality of supporting data. This latter
task is usually performed in genomic annotation editors, such as Apollo,
which give researchers the option to work simultaneously in the same
annotation project (Lee et al., 2013).
There are a number of issues affecting the quality of gene family
annotations, especially for either old or fast evolving families (Yohe
et al., 2019). First, new duplicates within a family usually originate
by unequal crossing-over and are found in tandem arrays in the genome,
being the more recent duplicates also the physically closest (Clifton et
al., 2017; Vieira, Sánchez-Gracia, & Rozas, 2007). This configuration
often causes local miss-assemblies that result in the incorrect or
failed identification of tandem duplicated copies (i.e., it produces
artifact, incomplete, or chimeric genes along a genomic region).
Secondly, the identification and characterization of gene copies in
medium- to large-sized families tends to be laborious, requiring data
from multiple sources, including well-annotated remote homologs and
hidden Markov model (HMM) profiles. Certainly, the fine and robust
identification and annotation of the complete repertory of a gene family
in a typical genome draft is a challenging task that requires important
additional efforts, which are very tedious to perform manually.
In order to facilitate this curation task, we have developed BITACORA, a
bioinformatics pipeline to assist the comprehensive annotation of gene
families in genome assemblies. BITACORA requires of a structurally
annotated genome (GFF and FASTA format) or a draft assembly, and a
curated database with well-annotated members of the focal gene families.
The program will perform comprehensive BLAST and HMMER searches
(Altschul, 1997; Eddy, 2011) to identify putative candidate gene regions
(already annotated, or not), combine evidences from all searches and
generate new gene models. The outcome of the pipeline consists in a new
structural annotation (GFF) file along with their encoded sequences.
These output sequences can be directly used to conduct downstream
functional or evolutionary analyses or to facilitate a fine
re-annotation in genome browsers such as Apollo (Lee et al., 2013).