Results
The performance of low-coverage WGS sequencing in species with large genomes
Our WGS sequencing produced 261 and 269 million clean 150-bp paired-end reads for the two colubrid species (Amphiesma stolatum andHeterodon platirhinos ), respectively (~40 G sequence data per sample). The genome sizes of Amphiesma stolatumand Heterodon platirhinos estimated by Jellyfish are 1,348 Mbp and 1,686 Mbp, respectively. We compared the performance of the method of Zhang et al. (2019) for extracting phylogenetic loci from low-coverage WGS data in these two colubrid species with large genomes and other four insects with relatively small genomes (Fig. 3). Detailed statistics of genome assembly and locus extraction are summarized in Appendix S3.
At levels of sequencing depth 5× and less, few genes can be extracted from genome assemblies of both small and large genome species (Fig. 3). At a level of 10× sequencing depth, the recovery rate of genes (complete + fragmented) for small genome species were significantly improved, with a range of 68%~95%. If the recovered rate was calculated by DNA sites, 41%~91% of DNA sites were recovered from small genome species. In contrast, the gene recovery rates of the two snake species were still low at 10× sequencing depth, less than 10%, and no more than 5% DNA sites were recovered from the two snake species (Fig. 3). When sequencing depth increased to 20×, the gene recovery rate of small genome species increased to 85%~98%, and the site recovery rate increased to 63%~99%, which is consistent with that reported by Zhang et al. (2019). In contrast, at a level of 20× sequencing depth, the gene recovery rate of the two snake species improved to 42%~59%, but only less than 35% DNA sites were recovered because most of the recovered genes were not complete (Fig. 3). These results suggest that extracting phylogenomic data from low-coverage WGS data in large genome species is much more difficult than in small genome species.
cDNA sequencing and reference ORF and UTR sets
After data quality control, we obtained a total of 45 million clean 150-bp paired-end reads (~ 6.8 G data) from cDNA sequencing. We assembled these reads using TRINITY and obtained a total of 31,841 cDNA sequences. After filtering by redundancy, sequencing depth (≥ 5×), and length (≥ 200 bp), a total of 26,409 cDNA sequences were retained. Among these sequences, 14,785 contained ORFs with a length greater than 300 bp. Of the protein sequences of these predicted ORFs, 8,429 have orthologous human proteins. Among these 8,429 cDNA sequences, 4000 (47.6%) have both 5’ and 3’ untranslated regions (UTRs), 3909 (46.3%) contain 3’ UTR, 273 (3.2%) contain 5’ UTR and 247 (2.9%) only contain coding sequences. From these 8,429 cDNA sequences, we extracted a total number of 8,429 ORFs and 10,665 UTRs (3391 of 5’ UTR and 7274 of 3’ UTR) as reference sets. The length of reference ORFs ranged from 300 bp to 14,325 bp, with an average length of 1,225 bp; the length of the reference UTRs ranged from 100 bp to 7,694 bp, with an average length of 716 bp. The total length of reference ORFs and UTRs are ~10,300 K and ~7,600 K, respectively.
FLc-Capture sequencing results
For the 24 colubrid snake and 12 outgroup snake samples, we obtained a total of 2,577 million quality-filtered 150-bp paired-end reads (~387 G data) using the FLc-Capture sequencing, ~10.7 G data per sample (range: 5.5 G - 20.3 G). The numbers of assembled contigs for different samples ranged from 158,424 to 872,008. These contigs were searched against the reference ORF and UTR sets to generate orthologous ORF and UTR sequences for each sample (see Methods). On average, for the 8,429 ORF and 10,665 UTR targets, we could obtain about 6,000 ORF and 5,900 UTR sequences from the ingroup species, and about 5,000 ORF and 4,600 UTR sequences from the outgroup species that are more distantly related to the probe species (Table 2). When the recovery rate is calculated by genes, the recovery rates of ORF (67%) are generally higher than those of UTR (51%) (Fig. 4a). However, when the recovery rate is calculated by nucleotides, the recovery rates of UTR (44%) are, on the contrary, higher than those of ORF (32%) (Fig. 4b), indicating that the recovered sequences of UTRs are more integrated than those of ORFs across reference sequences.
We explored capture specificity, the percentage of reads that can be aligned to target sequences (on-target) across all samples in our experiment (Table 2). The average on-target value of UTRs of all samples (23%) is higher than that of ORFs (12%) (Table 2), indicating that UTR sequences are more easily captured, possibly because they were physically continuous in genomes. However, the fluctuation of the on-target value of UTRs among samples (square deviation = 6.19, n = 36) is higher than that of ORFs (square deviation = 3.44, n = 36), suggesting that the capture efficiency of UTR sequences may be more sensitive to genetic distance. We found that the capture specificity of both ORF and UTR from a sample is negatively related to its genetic distance to the probe species, and the regression slope of UTR (coefficient = -204.39) is much smaller than that of ORF (coefficient = -64.04) (Fig. 5a). These results showed that the capture specificity of UTR decreases more rapidly than that of ORF with the increase of genetic distance from the probe species, in line with that noncoding sequences (UTR) evolve more rapidly than coding sequences (ORF).
In our experiment, we did not perform normalization treatment for our full-length cDNA probes to reduce the abundance of highly expressed transcripts. We thus want to know whether our capture experiment will dominate by those highly abundant cDNA probes. We found that, to some extent, the mean capture depth of each target does appear to relate to the abundance of its probe (Fig. 5b). However, the relationship between the capture depth of one target and the abundance of its probe is not absolute because there are also many ORF and UTR sequences with high capture depth (> 1000) while their corresponding probes are not abundant (< 100) (Fig. 5b). These results indicated that unnormalized cDNA probes do not significantly affect the ability of our method to sequence capture those target sequences corresponding to low-expressed transcripts.
Detailed statistics of the sequencing, contig assembly, gene recovery, nucleotide recovery, and capture specificity for each sample are summarized in Table 2. On the whole, the FLc-Capture method can simultaneously obtain thousands of coding and noncoding sequences from both the ingroup and outgroup samples, which indicates that our experimental design using homemade full-length cDNA probes to capture both of genomic coding and noncoding regions is successful.