Generating and assembling genomic data
To obtain a reference genome sequence (of expected size approx. 110
Mbp), one P. serrata specimen (98c_Pro_SE) was sequenced by
Oxford Nanopore technology (ONT) on one MinION II flowcell, producing
16.7 Gbp of data (3.3 million reads of average size of 5 kbp). Library
preparation and sequencing were performed at the Roy J. Carver
Biotechnology Center (University of Illinois at Urbana-Champaign, USA).
The ONT reads were trimmed with filtlong (v0.2.0;
https://github.com/rrwick/Filtlong) to retain sequences of at least 4000
bp and phred score of at least 20. Flye assembler (v2.5; Kolmogorov,
Yuan, Lin, & Pevzner, 2019) was used to perform de novo genome assembly
using ONT reads (with/without quality trimming) with the expected genome
size of 110 Mbp as a parameter. The resulting assemblies were polished
once with Racon (v1.3.3; https://github.com/isovic/racon) and twice with
Medaka (v0.6.5; https://nanoporetech.github.io/medaka/) using the ONT
reads. The final polishing was performed again with Racon, this time
using Illumina reads. Quality checks of the final polished assemblies
were performed with BUSCO (v3; Waterhouse et al., 2017). Biological
origin of the assembled contigs was verified by blastx (Altschul et al.,
1997) of the polished assembly (98 contigs) against the Pediculus
humanus corporis genome (GCA_000006295.1 JCVI_LOUSE_1.0). The
contigs for which the blastx did not yield matches with P. humanus
corporis genes were further analysed by blastx against nr/nt NCBI
database. By this procedure, we identified non-louse origin in 68
contigs (0,46% of the assembly length), which were therefore excluded
from the following analyses.
Whole genome re-sequencing was performed to generate metagenomic data
used to map the SNPs, and to assemble genomes of the L.
polyplacis symbionts and mitochondrial minichromosomes. gDNA libraries
for 26 louse specimens were constructed for paired-end Illumina
sequencing with the NovaSeq6000 instrument. All samples were sequenced
on one Illumina Novaseq lane producing on average 59.5 million 150 bp
paired-end reads (PE) per sample, resulting in coverage 18.9x - 34x for
the louse contigs (see results). Libraries were constructed with an
average insert size of 450bp. Fastq files were generated from the
sequence data using Casava v.1.8.2 or bcltofastq v.1.8.4 with Illumina
1.9 quality score encoding. All sequencing and fastq file generation
were carried out at the W. M. Keck Center (University of Illinois,
Urbana, IL, USA). Illumina reads of the re-sequenced specimens were
checked for quality and filtered in BBtools
(https://jgi.doe.gov/data-andtools/bbtools/), then assembled using the
SPAdes assembler v 3.10 (Bankevich et al., 2012), under default settings
with the parameter careful, decreasing the number of mismatches and
indels.
To analyze genetic structure in the area of secondary contact of theApodemus -Polyplax populations (and the presumed hybrid
zone of the parasite), we compared patterns obtained from three
different sources: nuclear SNPs, mitochondrial genomes, and genomes of
symbiotic bacterium L. polyplacis . Our hypothesis (visualized in
the Figure1 c, d) predicts that for both maternally inherited markers
(mitochondrial and symbiotic genomes), the inter-lineage divergence
(i.e. SE vs. SW) will be much higher than any divergence within the
lineages. This reflects the hypothetical scenario of two lineages
isolated during their stay in two different refugia and also during most
of the recolonization journey. On the contrary, since the hypothesis
presumes hybridization, no such distinction should be found in nuclear
SNPs across the presumed area of secondary contact; instead part of the
samples will be expected to form an intermediate group representing
hybrids. To visualize and quantify the diversification of maternally
inherited markers, we reconstructed genealogical trees and calculated
distance matrices for mitochondrial and symbiotic genomes. To visualize
nuclear diversification using SNPs, we analyzed population structure
using Principal Component Analysis (PCA) and Admixture software
(Alexander et al., 2009). Lastly, we reconstructed genealogical
relationships by building a maximum likelihood tree. Below, we provide
details on processing the three sources of the data.