Optional search round and final output
Finally, BITACORA can also be used to perform a second search round
using as the input data all proteins obtained in steps 1 and 2 (sFPDB
database). This additional step (step 3 in Fig 1) is especially useful
for searching remote homologs undetected in the first round. The final
BITACORA outcome will include 1) an updated GFF file with both b-curated
and b-novel gene models. 2) All non-redundant proteins predicted from
these feature annotations (in a FASTA file). 3) Two BED files, one with
the coordinates of all independent TBLASTN hits found in the genome
sequence, and the other with only those hits that would encode novel
putative exons and, 4) all protein sequences found in all steps.
Additional
features
BITACORA could be also used in
the absence of either a reference genome for the target species (e.g.
for transcriptomic studies; Protein mode) or a precompiled GFF (e.g. for
non-annotated genomes; Genome mode); in these cases, the input should be
a FASTA file with the set of predicted proteins or the genome sequences,
respectively (see Supplementary Material for alternative usage modes).
With BITACORA, we also distribute a series of scripts to perform some
useful tasks, such as estimating intron length statistics from a GFF,
converting GFF to GTF format, and retrieving all protein sequences
encoded by the features of a GFF file. Furthermore, to better adjust to
the particularities of each genome, BITACORA allows the user to specify
the values of the most important parameters, such as the E -value
for BLAST and HMMER searches, the number of threads in BLAST runs, and
the algorithm to build novel gene models from TBLASN hits.