2.5.3 Annotation of de novo transcriptome assembly
We used the pipeline available within the bioinformatics platform
OmicsBox 82,83 to annotate the de novotranscriptome as follows: i) we performed a blast search against the
non-redundant protein sequence database (nr v5) (blastx-fast; E-value
cutoff: 1e-05); ii) we retrieved gene ontology (GO)
terms for the sequences with blast hits using the gene_info and
gene2accession files from the NCBI database, and UniProt IDs using the
PSD, UniProt, Swiss-Prot, TrEMBL, RefSeq, GenPept and PDB databases;
iii) we annotated the sequences by assigning the most reliable and
specific GO terms according to their E-values (<
1e-06) and sequence similarities (high scoring segment
pair hit coverage cutoff of 80%) as well as the quality of their
annotation using the evidence code for each GO term (1 for experimental
evidence, 0.7-0.8 for computational analysis evidence, and 0.5-0.9 for
all other evidence types) 84; iv) in parallel, we
searched for matches between our sequences and protein domains and
families within the InterPro protein databases and the EggNOG database
to annotate predicted orthologues within our query sequences85; v) we merged the InterPro and EggNOG
classifications with the annotation resulting from step (iii).
Additionally, we used RepeatMasker v 4.0 to annotate transposons and
repeats in the de novo reference genome (obtained with the epiGBS
bioinformatics pipeline) using Embryophyta as reference species
collection (v.4.0.686) and DIAMOND v 0.8.22 to
annotate protein coding genes with the NCBI non-redundant protein
sequences database 87, in order to classify epigenetic
variants into different genomic features.