2.4.3 Identification of biological function of outlier genes
A gene ontology (GO) set
enrichment analysis (GSEA) was performed to associate biological and
common gene functions to the outlier SNPs and outlier 5 kb windows. This
analysis tests if multiple genes can be linked with certain GO
functions. First, functional annotation of the brown trout reference was
performed on the EggNOG v5.0 web-interface
(http://eggnog-mapper.embl.de/;
Huerta-Cepas et al., 2018) using brown trout protein FASTA files from
Ensembl (https://www.ensembl.org/Salmo_trutta/Info/Index). Second,
we converted the NCBI S. trutta annotation release 100 from GFF
to bed format using bedtools v2.27.1
(Quinlan & Hall, 2010), including only
CDS regions, and extracted all genes that overlapped with the identified
outlier SNPs and windows using bedtool’s
(Quinlan & Hall, 2010) ‘intersect’
function. Third, the R package topGO (Alexa
& Rahnenfuhrer, 2020) was used to test for over-representation of GO
biological processes using a node size of 5 and the ‘classic’ algorithm
to account for structural relationships among GO terms. Finally, GO
terms with p-values ≤0.01 (i.e., enriched GO terms) were retained and
then filtered in REVIGO
(Supek,
Bošnjak, Škunca, & Šmuc, 2011) to remove redundant GO terms. Treemaps
were drawn in R (Supek et al., 2011) to
visualize enriched and non-redundant GO superclusters.
2.4.4 Locating LDH-A genes in the brown trout reference
assembly The sympatric populations of Lakes Bunnersjöarna were identified with a
single allozyme locus, LDH-1 , that showed contrasting
homozygosity in the demes. LDH-1 is one of two loci (LDH-1and LDH-2 ) in salmonids coding for LDH-A
(Allendorf et al., 1984), and we
identified the DNA sequences relating to this enzyme. First, all genes
related to LDH (EC number 1.1.1.27, GO:0004457) were identified using
the GO term protein annotations from EggNOG (section 2.4.3.). Second, to
ensure no gene copy was missed, we used TBLASTN
(Altschul et al., 1990) with the FASTA
sequence for LDH extracted from the Uniprot database to search the
genome assembly, extracting hits with E- values<0.0001 and
bitscore>60. Finally, each LDH-A copy was manually
inspected for exon boundaries, reading frame and mapping performance by
loading the brown trout reference assembly and the bam files generated
with the Pool-seq and individual whole-genome sequencing pipelines (see
below for the latter) into the Integrative Genomics Viewer (IGV) v2.4.2
(Robinson et al., 2011). In both cases,
the sequencing reads had only been filtered for minimum base quality 20
and for being mapped as pairs, i.e. without depth or additional quality
filtering, to ensure that reads mapping to the LDH-A loci had not
been removed. The amino acid compositions for all LDH genes were
extracted using an in-house script (Appendix S4).
We calculated allele frequencies for the LDH-A gene copies in
both demes from the Pool-seq data (gene-based sync files; section
2.4.1.) using ‘snp-frequency-diff.pl’ from Popoolation2
(Kofler, Pandey, et al., 2011). Sequence
variants were called for these genes on 14 of the individual WGS data
using bcftools call (Danecek, McCarthy, &
Li, 2015), inspected using IGV, and visualized with Gviz
(Hahne & Ivanek, 2016) and ggplot2 in R
(Wickham, 2016).
2.5 Individual whole-genome
sequencing data processing and variant
calling
Sequenced reads from two individuals per deme were aligned against the
brown trout reference assembly using BWA mem v0.7.17
(BWA; Li & Durbin, 2009), and sorted
using SAMtools v1.8 (Li et al., 2009). No
data was filtered before mapping sequences against the reference genome
(unlike the pipeline for Pool-seq data). Alignment generated two bam
files for each individual (one per lane), which were merged per
individual using PICARD v2.10.3 MarkDuplicates
(https://broadinstitute.github.io/picard/). PCR duplicates were
also marked using PICARD v2.10.3. The quality of the obtained bam files
was assessed with Qualimap v2.2.1
(García-Alcalde et al., 2012).
Individual genomic variant call format files (gVCFs) were generated with
HaplotypeCaller from the Genome Analysis ToolKit (GATK) v3.8
(McKenna et al., 2010), and joint
genotyping of all brown trout samples (including fish from other lakes,
see 2.6) was performed with GATK GenotypeGVCFs. To filter out low
quality variants, we applied a hard filter approach, using GATK’s
VariantFiltration tool, separately for SNPs (QD < 2.0, MQ
< 40.0, FS > 10.0, MQRankSum < -5.0,
ReadPosRankSum < -5.0, SOR > 5.0) and indels (QD
< 2.0, FS > 10.0, ReadPosRankSum <
-5.0, SOR > 5.0). VCFtools v0.1.15
(Danecek et al., 2011) was used to retain
only bi-allelic SNPs with a minor allele frequency ≥ 0.01. Further,
variants from the un-assigned scaffolds were removed. We used BCFtools
v1.8 (Li et al., 2009) to annotate the ID
field of the joint VCF file.