Curating existing annotations
The BITACORA workflow has three main steps (Fig. 1). The first step
consists in the identification of all putative homologs of the FPDB
sequences from the focal gene family that are already present in the
input GFF file, and the curation of their gene models (referred
hereinafter as b-curated (bitacora-curated) gene models or proteins).
Specifically, the pipeline launches BLASTP and HMMER searches (Altschul,
1997; Eddy, 2011) against the proteins predicted from the features in
the input GFF using the FPDB protein sequences and HMM profiles as
queries; the resulting alignments are filtered for quality (i.e. BLASTP
hits covering at least two-thirds of the length of query sequences or
including at least the 80% of the complete protein used as a subject
are retained). The results from both searches are combined into a single
integrated result for every single protein (gene model). Then, BITACORA
trims the original models based in these combined results (retaining
only the aligned sequence) and reports new gene coordinates (b-curated
models) in a new updated GFF (uGFF), fixing for example all chimeric
annotations. Besides, the proteins encoded by these b-curated models are
incorporated to the FPDB (updated FPDB or uFPDB), to be used in an
additional search round.