Low-coverage whole-genome sequencing (WGS) and data
analysis
In order to test whether it is possible to extract phylogenomic data
from low-coverage WGS data for organisms with large genomes
(> 1 G), we selected two colubrid species (Amphiesma
stolatum and Heterodon platirhinos ) as the test samples. We
sequenced their DNA libraries on an Illumina HiSeq X-ten lane using
paired-end 150-bp mode. We obtained ~40 G sequence data
per sample corresponding to a sequencing depth of about 20×. The genome
sizes of the two colubrid species were estimated from the WGS data by
using Jellyfish version 2.3.0 (Guillaume & Carl 2011). As a comparison,
we also downloaded the WGS data of four insects with relatively small
genomes from NCBI: Pediculus humanus (108M), Phoebis
sennae (287M), Zootermopsis nevadensis (485M), andHalyomorpha halys (996M). The WGS data resources of these four
insect species are given in Appendix S2.
We adopted the method of Zhang et al. (2019) to directly extract
phylogenetic loci from the WGS data through de novo genome
assembling. The raw reads were first filtered to remove adapter
sequences and low-quality nucleotides. The filtered reads of each
species were assembled into scaffolds using the SPAdes version 3.8.1
genome assembler, using an auto K-mer mode (–cov-cutoff auto). We
downloaded a vertebrate core database comprising 2,586 genes (a total
length of ~3,280 K) and an insect core database composed
of 1,367 genes (a total length of ~1,285 K) from the
OrthoDB database as targeted gene clusters and used BUSCO v3.0.2
(Waterhouse et al., 2018) to extract orthologous sequences from the
genome scaffolds. The genome assembly and gene extraction process were
repeated at a sequencing depth of 1×, 5×, 10×, 20×, respectively. To
compare the effect of extract phylogenomic loci from low-coverage WGS
data of small and large genome species, we calculated the proportion of
complete/fragmented genes (gene recovery rates) and the proportion of
sites across the total length of reference genes.