BITACORA application example
To demonstrate the performance of BITACORA in annotating gene family
members in a group of genomes of different assembly quality, we present
an extended report of the results in Vizueta et al., (2018).
Specifically, we selected two of the arthropod chemosensory gene
families, insect gustatory receptors (GR) and Niemann-Pick type C2
(NPC2) proteins (Pelosi, Iovinella, Felicioli, & Dani, 2014; Robertson,
2015) in a subset of seven of the eleven chelicerate genomes surveyed in
this study (Table 1; Fig. 2). We selected these gene families since they
widely differ in the number of members and protein length. Whereas the
GR is a large gene family that encode seven-transmembrane receptors of
about 400 amino acids long, the NPC2 have few members and encode shorter
proteins (an average of about 150 amino acids); despite the different
length, both gene families have a similar average number of exons per
gene in the surveyed species. Furthermore, to validate the accuracy of
our software in gold standard annotated genomes, we also checked the
performance of BITACORA in identifying these members in the genome ofDrosophila melanogaster .
For the analysis, we retrieved genome sequences, annotations and
predicted peptides of D. melanogaster (r6.31, FlyBase; Adams et
al., 2000), the scorpions Centruroides sculpturatus (bark
scorpion, genome assembly version v1.0, annotation version v0.5.3; Human
Genome Sequencing Center (HGSC)) and Mesobuthus martensii (v1.0,
Scientific Data Sharing Platform Bioinformation (SDSPB)) (Cao et al.,
2013); and of the spiders Acanthoscurria geniculata (tarantula,
v1, NCBI Assembly, BGI) (Sanggaard et al., 2014), Stegodyphus
mimosarum (African social velvet spider, v1, NCBI Assembly, BGI)
(Sanggaard et al., 2014), Latrodectus hesperus (western black
widow, v1.0, HGSC), Parasteatoda tepidariorum (common house
spider, v1.0 Augustus 3, SpiderWeb and HGSC) (Schwager et al., 2017) andLoxosceles reclusa (brown recluse, v1.0, HGSC).
In addition, and with a benchmarking purpose, we compared the
performance of BITACORA with Augustus PPX, a method that also uses
protein profiles to improve automatic annotations of gene family members
(–proteinprofile; Keller et al., 2011; Mario Stanke, Schöffmann,
Morgenstern, & Waack, 2006), in annotating GR and NPC2 copies in the
same seven chelicerate genomes. Strikingly, BITACORA uncovered the
identification of thousands of new gene models previously undetected in
chelicerates, even after applying Augustus-PPX (Table 1; see also
supplementary data in Vizueta et al. 2018 to find the BITACORA curated
sequences). For instance, in the bark scorpion Centruroides
sculpturatus , the automatic annotation pipelines show 24 GR encoding
sequences, while BITACORA was able to identify and annotate 1,234 genes
or gene fragments, for the only 307 recovered with Augustus-PPX (Table
1; Supplementary table S1). Globally, BITACORA identified, annotated and
curated 3,570 sequences encoding GR proteins across the seven
chelicerate genomes (3,466 of which were absent in the available GFF for
this species), while Augustus-PPX only predicted 1,638 gene models for
this family (Table1; Supplementary table S1). It is largely known that
this gene family evolves rapidly in arthropods, both in terms of
sequence change and repertory size, encoding in the same genome very
recent and distantly related receptors as well as pseudogenes. Since
some of these receptors show a very restricted gene expression pattern
(expressed in specialized cells and tissues involved in chemoreception),
their transcripts are often missing in RNA-seq data sets, which are one
of evidences used for the automatic annotation of the genomes (Joseph &
Carlson, 2015; Robertson, 2015; Vizueta et al., 2017; Zhang, Zheng, Li,
& Fan, 2014). This fact, together with the huge divergence that exhibit
many copies (old duplication events and/or rapid evolution), are
probably the causes of the low accuracy of both automatic annotation and
Augustus-PPX.
The members of the NPC2 family, on the contrary, are much more conserved
at the sequence level and show higher levels of gene expression in
arthropods (Pelosi et al., 2014). As expected, the number of newly
identified copies is much lower than in the case of GRs. Even that,
BITACORA was able to detect 44 novel NPC2 encoding sequences, raising
the total annotated repertoire in these species from 75 to 119 (Table
1). In this case, Augustus-PPX was able to recover 97 gene models for
this gene family, which improves the performance of previous automatic
annotations, but still is outperformed by BITACORA. Importantly,
Augustus-PPX predicted thousands of gene models that are not real
members of the focal gene family (Supplementary table S1), requiring
further actions to separate gene family copies from false allocations.
Finally, both methods correctly annotated all members of the GR and NPC2
families in D. melanogaster genome, demonstrating the real
utility of these tools in the genome drafts of non-model organisms. It
is worth noting, however, that a non-negligible number of these novel
identified genes in chelicerate genomes are incomplete (about 40% and
63% of the GR and NPC2 members, respectively). This feature can be
partially explained by the poor genome assembly quality (indicated as
the N50 and number of scaffolds), or by the low number of annotated
proteins in the input GFF. Despite BITACORA can be useful under such
low-quality data, it will compromise its performance in terms of
complete gene models.