2.3 Codon based test of selection
Multiple consensus sequences of coding sequences for all samples were
extracted using bam2consensus function from BamBam v1.4 (Borowiec,
2016), allowing a minimum read coverage per site of 4X. BamBam uses the
individual bam files that were mapped to the G. calmariensisdraft de-novo genome, and extracts consensus for each CDS region
based upon the genome annotation. We then assessed summary statistics of
the consensus sequences using AMAS v1.0 (Borowiec, 2016). Only CDS
regions with the length >300 and low proportions of missing
values (<10%) were kept for downstream analysis
(N =11,368).
To limit our analysis to orthologous loci among our three species, we
assessed orthology among G. calmariensis protein sets and three
other Coleoptera species (Asian long-horned beetle [Anoplophora
glabripennis ], red flour beetle [Tribolium castaneum ] and
mountain pine beetle [Dendroctonus ponderosae ]) using
OrthoVenn2 (Xu et al., 2019) with default settings. A total of 4,591
single-copy-orthologs (SCO) were identified in G. calmariensis .
Then, we identified these SCOs from the previously identified high
quality consensus sequences, which resulted in 4,154 SCOs for downstream
analysis.
To detect adaptive evolution, we used HDMKPRF (Zhao et al., 2019), which
is an extension of MKPRF to analyse selection across multiple species
(Bustamante, Wakeley, Sawyer, & Hartl, 2001; Sawyer & Hartl, 1992). A
comprehensive justification for using HDMKPRF to detect genes under
selection can be found in Okamura et al (2022), but a main advantage is
the higher power for detecting weak and moderate selection. Compared
with the classical MK-test, the higher power of both MKPRF and HDMKPRF
is a consequence of adopting a Poisson random-field framework (Sawyer &
Hartl, 1992) where per gene selection intensities are estimated using a
Bayesian approach that combine information across multiple loci and
derive posterior distributions. The meaning of selection intensities is
adopted from Bustamante et al. (2001) and is akin to a neutrality index
measure (Hahn, 2019). The HDMKPRF improves the MKPRF-test by
simultaneously analysing polymorphism and divergence across multiple
species, which allow the test to determine in which lineage that
selection has occurred (Zhao et al., 2019). During this process, HDMKPRF
additionally estimates population genetic parameters for each species,
such as effective population sizes and mutation rates, which it uses to
account for lineage specific differences in these parameters in its
estimation of selection intensity per locus (for details see Zhao et
al., 2019). The script (Table S1) for performing the HDMKPRF to derive
selection intensities was kindly provided by Zhao et al. The input data
for the analysis included all 4,154 SCOs and the analysis was
implemented by first running 200,000 burn-in steps and thereafter
posterior parameter distributions were estimated from 400,000 steps in a
Markov chain Monte Carlo process with a thinning interval of 5, based on
author recommendations. A gene was considered to be under positive
selection when the 95% posterior credibility interval for the selection
intensity was >0, and similarly under negative selection
when the interval was <0. Because estimates of positive
selection using population resequencing data are usually biased downward
by the segregation of slightly deleterious mutations and some singleton
errors (Bierne & Eyre-Walker, 2004; Fay, Wyckoff, & Wu, 2002), we
removed singleton polymorphisms from all gene sets using a custom script
(Sattath, Elyashiv, Kolodny, Rinott, & Sella, 2011) and then tested the
effect on the power of detecting adaptive evolution by comparing HDMKPRF
to gene sets before and after singleton removal.
We performed a gene ontology (GO) enrichment test of biological process
terms of genes under selection using the BIOCONDUCTOR package topGO
(Alexa & Rahnefuhrer, 2010). The background genes included the 4,154
genes used in HDMKPRF and those genes that were significantly
(positively or negatively) selected were tested against this background
for enrichment of biological process terms (FDR<10%) using
the parent–child algorithm (Grossmann, Bauer, Robinson, & Vingron,
2007). To ensure a robust result, we only analysed GO terms with at
least 5 members (node size=5), and we used EggNOG v5.0 (Huerta-Cepas et
al., 2019) before the analysis to assign Gene Ontology (GO) terms to the
predicted protein sets using Insecta as taxonomic scope to restrict the
functional inferences to an insect-related scale. Afterwards, enriched
GO terms were clustered to representative functional subsets using the
REVIGO Drosophila database (Supek, Bošnjak, Škunca, & Smuc,
2011).