Discussion
Gene families are one of the most abundant and dynamic components of eukaryotic genomes. Therefore, having curated genomic data is fundamental not only to carry out comprehensive comparative or functional genomics studies on gene families, but also to understand global genome architecture and biology. During the last decades, the rapid development of sequencing technologies has enabled the large accumulation of genome sequences of non-model organisms. These projects, which often address very specific molecular ecology studies or are in the context of large comparative genomics analyses, typically rely on automatic annotation pipelines and very little efforts are devoted to curate these annotations (see Sánchez-Herrero et al., 2019; and references therein). The proteins predicted by automatic annotation tools often contain systematic errors, such as incomplete or chimeric gene models, which are especially notable in gene families given the repetitive nature of their members. Besides, since new copies commonly arise by unequal crossing-over, they are frequently found in physically close tandem arrays of similar sequences, further complicating annotations (Clifton et al., 2017; Vieira et al., 2007).
With this in mind, we have developed a bioinformatics tool that helps researchers to access these automatic annotations, extract the information of focal gene families, curate and update gene models and identify new copies from DNA sequences. Using BITACORA, gene family annotations can be really improved using both HMM profiles and iterative searches that incorporate the new variability found in previous searches. Indeed, we validated our tool by comparing its performance with a method developed to improve the annotation of gene family members matching a protein profile, Augustus-PPX (Keller et al., 2011b; Mario Stanke et al., 2006). BITACORA not only outperforms the annotations of Augustus-PPX in the two examples showed here, but also demonstrated to be more accurate in its predictions.
The estimation of gene gains and losses, and the associated birth and death rates analyses, are very sensitive to the quality of genome annotations. The example of the GR family in chelicerates demonstrates the importance of refining annotations using BITACORA. Indeed, using unsupervised annotations in low quality genome drafts of non-model organisms directly to estimate turnover rates might produce very erroneous results, not only in terms of gene counts but also in calculations biased to highly expressed and/or very recent copies. Then, BITACORA can be used to reduce considerably these errors and make more accurate and robust inferences about the age/origin of the family and of its mode of evolution.
On the other hand, the curation of both existing and new identified members of a family with BITACORA might be also crucial for further analysis on their sequence evolution. The quality of multiple sequence alignments, which are used to determine orthology groups, to obtain divergence estimates or to detect the footprint of natural selection in gene family members, is strongly compromised by the presence of badly annotated copies, including chimeras and incorrectly annotated fragments. Using BITACORA we can detect these artifacts and either fix or discard them from further analyses.
Despite its proven utility, we are aware that BITACORA does not provide perfect annotations for a gene family. The use of GeMoMa algorithm is more sensitive than the close-proximity method generating more accurate gene models, although, in the presence of assembly errors or highly fragmented genomes, this approach might fail to identify genes, and especially putative pseudogenes. In these cases, the close-proximity method could help to detect these cases and report them in final output.
Furthermore, to overcome putative gene model errors, BITACORA implements some filtering steps to determine if the predicted coding sequences are correct. The program carries out a HMMER search to identify the protein family domain in all new annotated sequences. In addition, if the HMMER search is negative, BITACORA can relax this step by checking if the novel genes show significant BLASTP hits in a search against FPDB proteins. In this case, the sensitivity of the annotations will increase at the expense of specificity (i.e. it could generate false allocations to the focal family in the presence of repetitive regions or FPDB contaminations, for instance). It is important to note that BITACORA generates homology-based predictions that could require different levels of experimental validation depending on the nature of further downstream analyses.
Notwithstanding such filtering steps, BITACORA offers an output directly readable in genome editor tools, such as Apollo, which facilitate researchers to improve gene models. Fig. 3 shows an example of the annotation tracks generated by BITACORA (GFF3 and BED files) for a cluster of three members of the NPC2 family in the genome of the spiderP. tepidariorum . The automatic annotation of this region using MAKER2 (track Ptep_v0.5.3-Models), generated a chimeric gene model (two different genes are fused) which could be easily curated using BITACORA. Additionally, despite TBLASTN searches detected a putative novel exon in the gene encoding NPC2_5, GeMoMa did not include this sequence in the final gene model due to the presence of an in-frame stop codon. In order to decide if this stop codon is an annotation, assembly or sequencing artifact, it would be necessary, for instance, to verify if the exon exists in other species, if that region is transcribed, or if the gene is under selective constraints.