Ab initio gene prediction
BRAKER2 pipeline (Camacho et al., 2009; Hoff, Lange, Lomsadze, Borodovsky, & Stanke, 2016; Hoff, Lomsadze, Borodovsky, & Stanke, 2019; Lomsadze, Burns, & Borodovsky, 2014; Stanke, Schöffmann, Morgenstern, & Waack, 2006; Stanke, Diekhans, Baertsch, & Haussler, 2008) was employed for gene prediction. First of all, repetitive sequences in the genome identified by RepeatMasker were soft-masked. To generate extrinsic evidence for gene prediction, eleven sets of RNA-seq reads (Table S3) was mapped to the genome sequence by using HiSAT2 (Kim, Langmead, & Salzberg, 2015). The resultant BAM files generated by HiSAT2 were submitted to BRAKER2 by using ‘–bam’ option. Parallelly, we assembled the RNA-seq reads using Trinity assembler (Haas et al., 2013). Then, the tr2aacds.pl program bundled in EvidentialGene suite (http://arthropods.eugenes.org/EvidentialGene/evigene/) was used to merge the assemblies from multiple transcriptome data sets. The merged transcriptome assemblies were aligned to the genome sequence using PASA (Haas et al., 2008; Haas et al., 2013) for identifying the exon regions. In addition to tr2aacds.pl program, StringTie (Pertea et al., 2015) was also used to merge multiple transcriptome data for exon region prediction. Furthermore, amino acid sequences of manually annotated sequences of S. ricini deposited in the Universal Protein Resource database (UniProt,http://www.uniprot.org ) (Bateman, 2019) were aligned to genome sequence using exonerate v2.2.0 (Slater, & Birney, 2005) to obtain protein spliced alignment information. Finally, multiple predictions generated by BRAKER2, PASA, StringTie and exonerate were integrated by EvidenceModeler (Haas et al., 2008).