Low-coverage whole-genome sequencing (WGS) and data analysis
In order to test whether it is possible to extract phylogenomic data from low-coverage WGS data for organisms with large genomes (> 1 G), we selected two colubrid species (Amphiesma stolatum and Heterodon platirhinos ) as the test samples. We sequenced their DNA libraries on an Illumina HiSeq X-ten lane using paired-end 150-bp mode. We obtained ~40 G sequence data per sample corresponding to a sequencing depth of about 20×. The genome sizes of the two colubrid species were estimated from the WGS data by using Jellyfish version 2.3.0 (Guillaume & Carl 2011). As a comparison, we also downloaded the WGS data of four insects with relatively small genomes from NCBI: Pediculus humanus (108M), Phoebis sennae (287M), Zootermopsis nevadensis (485M), andHalyomorpha halys (996M). The WGS data resources of these four insect species are given in Appendix S2.
We adopted the method of Zhang et al. (2019) to directly extract phylogenetic loci from the WGS data through de novo genome assembling. The raw reads were first filtered to remove adapter sequences and low-quality nucleotides. The filtered reads of each species were assembled into scaffolds using the SPAdes version 3.8.1 genome assembler, using an auto K-mer mode (–cov-cutoff auto). We downloaded a vertebrate core database comprising 2,586 genes (a total length of ~3,280 K) and an insect core database composed of 1,367 genes (a total length of ~1,285 K) from the OrthoDB database as targeted gene clusters and used BUSCO v3.0.2 (Waterhouse et al., 2018) to extract orthologous sequences from the genome scaffolds. The genome assembly and gene extraction process were repeated at a sequencing depth of 1×, 5×, 10×, 20×, respectively. To compare the effect of extract phylogenomic loci from low-coverage WGS data of small and large genome species, we calculated the proportion of complete/fragmented genes (gene recovery rates) and the proportion of sites across the total length of reference genes.