Identifying new genomic regions encoding gene family members
In the second step, BITACORA uses TBLASTN to search the genome sequences
for regions encoding homologs of the proteins included in the uFPDB but
not annotated in the uGFF. BITACORA implements two different approaches
for generating novel gene models from TBLASTN results (set with the
“gemoma” parameter). For the one hand, BITACORA implements the GeMoMa
tool, a homology-based gene prediction program that uses amino acid
sequence and intron position conservation to reconstruct gene models
from BLAST hits (Keilwagen, Hartung, & Grau, 2019; Keilwagen, Hartung,
Paulini, Twardziok, & Grau, 2018; Keilwagen et al., 2016). The second
approach is based on a “close proximity” strategy. Under this
strategy, all independent TBLASTN hits (i.e., after merging all
alignments that overlap in TBLASTN results) located in the same scaffold
and separated by less than a predetermined distance (set with the
“intron distance” parameter), are connected to form a unique gene
model. This step intends to join all coding exons of the same gene based
on the average intron length in the focal genome. We provide some
scripts to estimate this average length from the input GFF (see
Supplementary Material).
Finally, to avoid reporting inaccurate gene models due to artifactual
gene fusions in dense gene clusters or any other possible errors
(regardless of which algorithm of the abovementioned has been applied),
BITACORA will check for the presence of the gene family-specific protein
domain (using the HMM profile in FPDB), and only reports in the curated
dataset those gene models containing the domain. In addition, all
proteins are tagged with a label that indicates the number of different
domains in the sequence (Ndom). This final filtering step can be relaxed
using the BITACORA ”genomicblastp” option, which evaluates the presence
of positive hits in either HMMER, or BLASTP searches against the
proteins in FPDB (see Supplementary Material for details).