Nuclear SNPs
Reads from 26 louse genomes were individually mapped with Bowtie2 (Langmead & Salzberg, 2012) to the reference genome described above. Before mapping, the reference sequence was indexed using Samtools (Li et al., 2009) and a dictionary file was made with CreateSequenceDictionary in Picard v.2.0.1 (https://broadinstitute.github.io/picard/). After mapping with Bowtie2, resulting SAM files were sorted to the BAM files and indexed using Samtools. Duplicated sequences were removed from sorted BAM files with Picard v.2.0.1 and quality of mapping was verified with QualiMap (http://qualimap.bioinfo.cipf.es/). SNPs for population-level analysis were called using the GATK Genome Analysis Toolkit following the “Best Practices” guide from the Broad Institute (Van der Auwera et al., 2013). SNPs were jointly called for all 26P. serrata samples and filtered with QD (quality by depth) <2.0, FS (Fisher strand test) >60.0, MQ (mapping quality) <40.0, and MQRankSum (mapping quality rank sum test) <–12.5. The SNPs from GATK were filtered with minor allele frequency (MAF) threshold 0.05 in PLINK 1.9 (https://www.cog-genomics.org/plink/1.9/). SNPs in linkage disequilibrium (LD) were pruned with the squared coefficient of correlation (r2) 0.2 and missing data threshold 0.2.
Resulting 47,595 variants passed filters and quality control in 26P. serrata samples and used in population structure estimation. PCA was performed in R package SNPRelate (DOI: 10.18129/B9.bioc.SNPRelate) and phylogenetic tree based on Maximum likelihood analysis was reconstructed using SNPhylo pipeline (Lee, Guo, Wang, Kim, & Paterson, 2014) with 1000 bootstraps. Ten independent runs of admixture analysis were performed to assess the proportion of individual ancestry with different numbers of hypothetical populations, using the ADMIXTURE software v.1.3.0 (Alexander et al., 2009). An unsupervised model approach based on maximum likelihood was applied to the genotype matrix for 1-10 populations (K=1-10). Most probable estimate of K was calculated with cross-validation procedure. Summarization and graphical output of independent runs for each K was obtained with CLUMPAK (Kopelman et al., 2015). Based on the obtained results samples were divided into three groups (SW, SE, and hybrids), and their diversity measures (heterozygosity and pairwise FST) were estimated in VCFtools (Danecek et al., 2011) and SNPRelate, respectively..
Legionella polyplacis genomes reconstruction and comparison
The contigs corresponding to Legionella polyplacis were visually identified using ORF prediction done in the Geneious package (the prokaryotic gene arrangement could be readily recognized by the density of predicted ORFs). In most samples, the genome of L. polyplaciswas assembled into a single complete contig. In three assemblies (DPH41,19JA1_SW, 46MAN_SW) the symbiont genome was highly fragmented or could not be found, despite the good assembly quality of the louse genome (these samples were removed from the L. polyplacisanalysis). Complete L. polyplacis genomes were annotated in RAST (Aziz et al., 2008) and aligned using Mafft algorithm implemented in Geneious. In three samples (Ne125b_Kot_SW, 99b_Pro_SE, 98c_Pro_SE) the rRNA region did not assemble correctly and was only represented by short fragments. To extend these fragments into a full length, we used the program aTRAM 2.0 (Allen, LaFrance, Folk, Johnson, & Guralnick, 2018). Phylogenetic analysis of the resulting 530,063 bp long matrix was performed in Phyml (Guindon et al., 2010). The evolutionary model GTR+I+G was determined for the whole concatenated matrix (considering the strong phylogenetic signal in the matrix, the model selection is a purely formalistic step in the presented analysis) by a corrected Akaike information criterion in jModelTest2 (Darriba et al., 2012) based on the AIC. A comparison of gene content was done manually using the annotated alignment. We were taking into consideration the following criteria: presence of the annotations across all genomes; presence of the corresponding sequence (in case of missing annotation); distribution of differences (i.e. does the difference reflect the SE/SW split).