Generating and assembling genomic data
To obtain a reference genome sequence (of expected size approx. 110 Mbp), one P. serrata specimen (98c_Pro_SE) was sequenced by Oxford Nanopore technology (ONT) on one MinION II flowcell, producing 16.7 Gbp of data (3.3 million reads of average size of 5 kbp). Library preparation and sequencing were performed at the Roy J. Carver Biotechnology Center (University of Illinois at Urbana-Champaign, USA). The ONT reads were trimmed with filtlong (v0.2.0; https://github.com/rrwick/Filtlong) to retain sequences of at least 4000 bp and phred score of at least 20. Flye assembler (v2.5; Kolmogorov, Yuan, Lin, & Pevzner, 2019) was used to perform de novo genome assembly using ONT reads (with/without quality trimming) with the expected genome size of 110 Mbp as a parameter. The resulting assemblies were polished once with Racon (v1.3.3; https://github.com/isovic/racon) and twice with Medaka (v0.6.5; https://nanoporetech.github.io/medaka/) using the ONT reads. The final polishing was performed again with Racon, this time using Illumina reads. Quality checks of the final polished assemblies were performed with BUSCO (v3; Waterhouse et al., 2017). Biological origin of the assembled contigs was verified by blastx (Altschul et al., 1997) of the polished assembly (98 contigs) against the Pediculus humanus corporis genome (GCA_000006295.1 JCVI_LOUSE_1.0). The contigs for which the blastx did not yield matches with P. humanus corporis genes were further analysed by blastx against nr/nt NCBI database. By this procedure, we identified non-louse origin in 68 contigs (0,46% of the assembly length), which were therefore excluded from the following analyses.
Whole genome re-sequencing was performed to generate metagenomic data used to map the SNPs, and to assemble genomes of the L. polyplacis symbionts and mitochondrial minichromosomes. gDNA libraries for 26 louse specimens were constructed for paired-end Illumina sequencing with the NovaSeq6000 instrument. All samples were sequenced on one Illumina Novaseq lane producing on average 59.5 million 150 bp paired-end reads (PE) per sample, resulting in coverage 18.9x - 34x for the louse contigs (see results). Libraries were constructed with an average insert size of 450bp. Fastq files were generated from the sequence data using Casava v.1.8.2 or bcltofastq v.1.8.4 with Illumina 1.9 quality score encoding. All sequencing and fastq file generation were carried out at the W. M. Keck Center (University of Illinois, Urbana, IL, USA). Illumina reads of the re-sequenced specimens were checked for quality and filtered in BBtools (https://jgi.doe.gov/data-andtools/bbtools/), then assembled using the SPAdes assembler v 3.10 (Bankevich et al., 2012), under default settings with the parameter careful, decreasing the number of mismatches and indels.
To analyze genetic structure in the area of secondary contact of theApodemus -Polyplax populations (and the presumed hybrid zone of the parasite), we compared patterns obtained from three different sources: nuclear SNPs, mitochondrial genomes, and genomes of symbiotic bacterium L. polyplacis . Our hypothesis (visualized in the Figure1 c, d) predicts that for both maternally inherited markers (mitochondrial and symbiotic genomes), the inter-lineage divergence (i.e. SE vs. SW) will be much higher than any divergence within the lineages. This reflects the hypothetical scenario of two lineages isolated during their stay in two different refugia and also during most of the recolonization journey. On the contrary, since the hypothesis presumes hybridization, no such distinction should be found in nuclear SNPs across the presumed area of secondary contact; instead part of the samples will be expected to form an intermediate group representing hybrids. To visualize and quantify the diversification of maternally inherited markers, we reconstructed genealogical trees and calculated distance matrices for mitochondrial and symbiotic genomes. To visualize nuclear diversification using SNPs, we analyzed population structure using Principal Component Analysis (PCA) and Admixture software (Alexander et al., 2009). Lastly, we reconstructed genealogical relationships by building a maximum likelihood tree. Below, we provide details on processing the three sources of the data.