Discussion
Gene families are one of the most abundant and dynamic components of
eukaryotic genomes. Therefore, having curated genomic data is
fundamental not only to carry out comprehensive comparative or
functional genomics studies on gene families, but also to understand
global genome architecture and biology. During the last decades, the
rapid development of sequencing technologies has enabled the large
accumulation of genome sequences of non-model organisms. These projects,
which often address very specific molecular ecology studies or are in
the context of large comparative genomics analyses, typically rely on
automatic annotation pipelines and very little efforts are devoted to
curate these annotations (see Sánchez-Herrero et al., 2019; and
references therein). The proteins predicted by automatic annotation
tools often contain systematic errors, such as incomplete or chimeric
gene models, which are especially notable in gene families given the
repetitive nature of their members. Besides, since new copies commonly
arise by unequal crossing-over, they are frequently found in physically
close tandem arrays of similar sequences, further complicating
annotations (Clifton et al., 2017; Vieira et al., 2007).
With this in mind, we have developed a bioinformatics tool that helps
researchers to access these automatic annotations, extract the
information of focal gene families, curate and update gene models and
identify new copies from DNA sequences. Using BITACORA, gene family
annotations can be really improved using both HMM profiles and iterative
searches that incorporate the new variability found in previous
searches. Indeed, we validated our tool by comparing its performance
with a method developed to improve the annotation of gene family members
matching a protein profile, Augustus-PPX (Keller et al., 2011b; Mario
Stanke et al., 2006). BITACORA not only outperforms the annotations of
Augustus-PPX in the two examples showed here, but also demonstrated to
be more accurate in its predictions.
The estimation of gene gains and losses, and the associated birth and
death rates analyses, are very sensitive to the quality of genome
annotations. The example of the GR family in chelicerates demonstrates
the importance of refining annotations using BITACORA. Indeed, using
unsupervised annotations in low quality genome drafts of non-model
organisms directly to estimate turnover rates might produce very
erroneous results, not only in terms of gene counts but also in
calculations biased to highly expressed and/or very recent copies. Then,
BITACORA can be used to reduce considerably these errors and make more
accurate and robust inferences about the age/origin of the family and of
its mode of evolution.
On the other hand, the curation of both existing and new identified
members of a family with BITACORA might be also crucial for further
analysis on their sequence evolution. The quality of multiple sequence
alignments, which are used to determine orthology groups, to obtain
divergence estimates or to detect the footprint of natural selection in
gene family members, is strongly compromised by the presence of badly
annotated copies, including chimeras and incorrectly annotated
fragments. Using BITACORA we can detect these artifacts and either fix
or discard them from further analyses.
Despite its proven utility, we are aware that BITACORA does not provide
perfect annotations for a gene family. The use of GeMoMa algorithm is
more sensitive than the close-proximity method generating more accurate
gene models, although, in the presence of assembly errors or highly
fragmented genomes, this approach might fail to identify genes, and
especially putative pseudogenes. In these cases, the close-proximity
method could help to detect these cases and report them in final output.
Furthermore, to overcome putative gene model errors, BITACORA implements
some filtering steps to determine if the predicted coding sequences are
correct. The program carries out a HMMER search to identify the protein
family domain in all new annotated sequences. In addition, if the HMMER
search is negative, BITACORA can relax this step by checking if the
novel genes show significant BLASTP hits in a search against FPDB
proteins. In this case, the sensitivity of the annotations will increase
at the expense of specificity (i.e. it could generate false allocations
to the focal family in the presence of repetitive regions or FPDB
contaminations, for instance). It is important to note that BITACORA
generates homology-based predictions that could require different levels
of experimental validation depending on the nature of further downstream
analyses.
Notwithstanding such filtering steps, BITACORA offers an output directly
readable in genome editor tools, such as Apollo, which facilitate
researchers to improve gene models. Fig. 3 shows an example of the
annotation tracks generated by BITACORA (GFF3 and BED files) for a
cluster of three members of the NPC2 family in the genome of the spiderP. tepidariorum . The automatic annotation of this region using
MAKER2 (track Ptep_v0.5.3-Models), generated a chimeric gene model (two
different genes are fused) which could be easily curated using BITACORA.
Additionally, despite TBLASTN searches detected a putative novel exon in
the gene encoding NPC2_5, GeMoMa did not include this sequence in the
final gene model due to the presence of an in-frame stop codon. In order
to decide if this stop codon is an annotation, assembly or sequencing
artifact, it would be necessary, for instance, to verify if the exon
exists in other species, if that region is transcribed, or if the gene
is under selective constraints.