2.5.3 Annotation of de novo transcriptome assembly
We used the pipeline available within the bioinformatics platform OmicsBox 82,83 to annotate the de novotranscriptome as follows: i) we performed a blast search against the non-redundant protein sequence database (nr v5) (blastx-fast; E-value cutoff: 1e-05); ii) we retrieved gene ontology (GO) terms for the sequences with blast hits using the gene_info and gene2accession files from the NCBI database, and UniProt IDs using the PSD, UniProt, Swiss-Prot, TrEMBL, RefSeq, GenPept and PDB databases; iii) we annotated the sequences by assigning the most reliable and specific GO terms according to their E-values (< 1e-06) and sequence similarities (high scoring segment pair hit coverage cutoff of 80%) as well as the quality of their annotation using the evidence code for each GO term (1 for experimental evidence, 0.7-0.8 for computational analysis evidence, and 0.5-0.9 for all other evidence types) 84; iv) in parallel, we searched for matches between our sequences and protein domains and families within the InterPro protein databases and the EggNOG database to annotate predicted orthologues within our query sequences85; v) we merged the InterPro and EggNOG classifications with the annotation resulting from step (iii).
Additionally, we used RepeatMasker v 4.0 to annotate transposons and repeats in the de novo reference genome (obtained with the epiGBS bioinformatics pipeline) using Embryophyta as reference species collection (v.4.0.686) and DIAMOND v 0.8.22 to annotate protein coding genes with the NCBI non-redundant protein sequences database 87, in order to classify epigenetic variants into different genomic features.