Ab initio gene prediction
BRAKER2 pipeline (Camacho et al., 2009; Hoff, Lange, Lomsadze,
Borodovsky, & Stanke, 2016; Hoff, Lomsadze, Borodovsky, & Stanke,
2019; Lomsadze, Burns, & Borodovsky, 2014; Stanke, Schöffmann,
Morgenstern, & Waack, 2006; Stanke, Diekhans, Baertsch, & Haussler,
2008) was employed for gene prediction. First of all, repetitive
sequences in the genome identified by RepeatMasker were soft-masked. To
generate extrinsic evidence for gene prediction, eleven sets of RNA-seq
reads (Table S3) was mapped to the genome sequence by using HiSAT2 (Kim,
Langmead, & Salzberg, 2015). The resultant BAM files generated by
HiSAT2 were submitted to BRAKER2 by using ‘–bam’ option. Parallelly,
we assembled the RNA-seq reads using Trinity assembler (Haas et al.,
2013). Then, the tr2aacds.pl program bundled in EvidentialGene suite
(http://arthropods.eugenes.org/EvidentialGene/evigene/) was used to
merge the assemblies from multiple
transcriptome
data sets. The merged transcriptome assemblies were aligned to the
genome sequence using PASA (Haas et al., 2008; Haas et al., 2013) for
identifying the exon regions. In addition to tr2aacds.pl program,
StringTie (Pertea et al., 2015) was also used to merge multiple
transcriptome data for exon region prediction. Furthermore, amino acid
sequences of manually annotated sequences of S. ricini deposited
in the Universal Protein Resource database (UniProt,http://www.uniprot.org ) (Bateman, 2019) were aligned to genome
sequence using exonerate v2.2.0 (Slater, & Birney, 2005) to obtain
protein spliced alignment information. Finally, multiple predictions
generated by BRAKER2, PASA, StringTie and exonerate were integrated by
EvidenceModeler (Haas et al., 2008).