2.3 Codon based test of selection
Multiple consensus sequences of coding sequences for all samples were extracted using bam2consensus function from BamBam v1.4 (Borowiec, 2016), allowing a minimum read coverage per site of 4X. BamBam uses the individual bam files that were mapped to the G. calmariensisdraft de-novo genome, and extracts consensus for each CDS region based upon the genome annotation. We then assessed summary statistics of the consensus sequences using AMAS v1.0 (Borowiec, 2016). Only CDS regions with the length >300 and low proportions of missing values (<10%) were kept for downstream analysis (N =11,368).
To limit our analysis to orthologous loci among our three species, we assessed orthology among G. calmariensis protein sets and three other Coleoptera species (Asian long-horned beetle [Anoplophora glabripennis ], red flour beetle [Tribolium castaneum ] and mountain pine beetle [Dendroctonus ponderosae ]) using OrthoVenn2 (Xu et al., 2019) with default settings. A total of 4,591 single-copy-orthologs (SCO) were identified in G. calmariensis . Then, we identified these SCOs from the previously identified high quality consensus sequences, which resulted in 4,154 SCOs for downstream analysis.
To detect adaptive evolution, we used HDMKPRF (Zhao et al., 2019), which is an extension of MKPRF to analyse selection across multiple species (Bustamante, Wakeley, Sawyer, & Hartl, 2001; Sawyer & Hartl, 1992). A comprehensive justification for using HDMKPRF to detect genes under selection can be found in Okamura et al (2022), but a main advantage is the higher power for detecting weak and moderate selection. Compared with the classical MK-test, the higher power of both MKPRF and HDMKPRF is a consequence of adopting a Poisson random-field framework (Sawyer & Hartl, 1992) where per gene selection intensities are estimated using a Bayesian approach that combine information across multiple loci and derive posterior distributions. The meaning of selection intensities is adopted from Bustamante et al. (2001) and is akin to a neutrality index measure (Hahn, 2019). The HDMKPRF improves the MKPRF-test by simultaneously analysing polymorphism and divergence across multiple species, which allow the test to determine in which lineage that selection has occurred (Zhao et al., 2019). During this process, HDMKPRF additionally estimates population genetic parameters for each species, such as effective population sizes and mutation rates, which it uses to account for lineage specific differences in these parameters in its estimation of selection intensity per locus (for details see Zhao et al., 2019). The script (Table S1) for performing the HDMKPRF to derive selection intensities was kindly provided by Zhao et al. The input data for the analysis included all 4,154 SCOs and the analysis was implemented by first running 200,000 burn-in steps and thereafter posterior parameter distributions were estimated from 400,000 steps in a Markov chain Monte Carlo process with a thinning interval of 5, based on author recommendations. A gene was considered to be under positive selection when the 95% posterior credibility interval for the selection intensity was >0, and similarly under negative selection when the interval was <0. Because estimates of positive selection using population resequencing data are usually biased downward by the segregation of slightly deleterious mutations and some singleton errors (Bierne & Eyre-Walker, 2004; Fay, Wyckoff, & Wu, 2002), we removed singleton polymorphisms from all gene sets using a custom script (Sattath, Elyashiv, Kolodny, Rinott, & Sella, 2011) and then tested the effect on the power of detecting adaptive evolution by comparing HDMKPRF to gene sets before and after singleton removal.
We performed a gene ontology (GO) enrichment test of biological process terms of genes under selection using the BIOCONDUCTOR package topGO (Alexa & Rahnefuhrer, 2010). The background genes included the 4,154 genes used in HDMKPRF and those genes that were significantly (positively or negatively) selected were tested against this background for enrichment of biological process terms (FDR<10%) using the parent–child algorithm (Grossmann, Bauer, Robinson, & Vingron, 2007). To ensure a robust result, we only analysed GO terms with at least 5 members (node size=5), and we used EggNOG v5.0 (Huerta-Cepas et al., 2019) before the analysis to assign Gene Ontology (GO) terms to the predicted protein sets using Insecta as taxonomic scope to restrict the functional inferences to an insect-related scale. Afterwards, enriched GO terms were clustered to representative functional subsets using the REVIGO Drosophila database (Supek, Bošnjak, Škunca, & Smuc, 2011).