Results
The
performance of low-coverage WGS sequencing in species with large
genomes
Our
WGS sequencing produced 261 and 269 million clean 150-bp paired-end
reads for the two colubrid species (Amphiesma stolatum andHeterodon platirhinos ), respectively (~40 G
sequence data per sample). The genome sizes of Amphiesma stolatumand Heterodon platirhinos estimated by Jellyfish are 1,348 Mbp
and 1,686 Mbp, respectively. We
compared the performance of the method of Zhang et al. (2019) for
extracting phylogenetic loci from low-coverage WGS data in these two
colubrid species with large genomes and other four insects with
relatively small genomes (Fig. 3). Detailed statistics of genome
assembly and locus extraction are summarized in Appendix S3.
At
levels of sequencing depth 5× and less, few genes can be extracted from
genome assemblies of both small and large genome species (Fig. 3).
At a level of 10× sequencing depth,
the recovery rate of genes (complete + fragmented) for small genome
species were significantly improved, with a range of
68%~95%. If the recovered rate was calculated by DNA
sites, 41%~91% of DNA sites were recovered from small
genome species. In contrast, the gene recovery rates of the two snake
species were still low at 10× sequencing depth, less than 10%, and no
more than 5% DNA sites were recovered from the two snake species (Fig.
3). When sequencing depth increased
to 20×, the gene recovery rate of small genome species increased to
85%~98%, and the site recovery rate increased to
63%~99%, which is consistent with that reported by
Zhang et al. (2019). In contrast, at a level of 20× sequencing depth,
the gene recovery rate of the two snake species improved to
42%~59%, but only less than 35% DNA sites were
recovered because most of the recovered genes were not complete (Fig.
3). These results suggest that extracting phylogenomic data from
low-coverage WGS data in large genome species is much more difficult
than in small genome species.
cDNA sequencing and
reference ORF and UTR sets
After data quality control, we obtained a total of 45 million clean
150-bp paired-end reads (~ 6.8 G data) from cDNA
sequencing. We assembled these reads using TRINITY and obtained a total
of 31,841 cDNA sequences. After filtering by redundancy, sequencing
depth (≥ 5×), and length (≥ 200 bp), a total of 26,409 cDNA sequences
were retained. Among these sequences, 14,785 contained ORFs with a
length greater than 300 bp. Of the protein sequences of these predicted
ORFs, 8,429 have orthologous human proteins. Among these 8,429 cDNA
sequences, 4000 (47.6%) have both 5’ and 3’ untranslated regions
(UTRs), 3909 (46.3%) contain 3’ UTR, 273 (3.2%) contain 5’ UTR and 247
(2.9%) only contain coding sequences. From these 8,429 cDNA sequences,
we extracted a total number of 8,429 ORFs and 10,665 UTRs (3391 of 5’
UTR and 7274 of 3’ UTR) as reference sets. The length of reference ORFs
ranged from 300 bp to 14,325 bp, with an average length of 1,225 bp; the
length of the reference UTRs ranged from 100 bp to 7,694 bp, with an
average length of 716 bp. The total length of reference ORFs and UTRs
are ~10,300 K and ~7,600 K,
respectively.
FLc-Capture
sequencing results
For the 24 colubrid snake and 12 outgroup snake samples, we obtained a
total of 2,577 million quality-filtered 150-bp paired-end reads
(~387 G data) using the
FLc-Capture sequencing,
~10.7 G data per sample (range: 5.5 G - 20.3 G). The
numbers of assembled contigs for different samples ranged from 158,424
to 872,008. These contigs were searched against the reference ORF and
UTR sets to generate orthologous ORF and UTR sequences for each sample
(see Methods). On average, for the
8,429 ORF and 10,665 UTR targets, we could obtain about 6,000 ORF and
5,900 UTR sequences from the ingroup species, and about 5,000 ORF and
4,600 UTR sequences from the outgroup species that are more distantly
related to the probe species (Table 2).
When
the recovery rate is calculated by genes, the recovery rates of ORF
(67%) are generally higher than those of UTR (51%) (Fig. 4a).
However, when the recovery rate is
calculated by nucleotides, the recovery rates of UTR (44%) are, on the
contrary, higher than those of ORF (32%) (Fig. 4b), indicating that the
recovered sequences of UTRs are more integrated than those of ORFs
across reference sequences.
We explored capture specificity,
the percentage of reads that can be aligned to target sequences
(on-target) across all samples in our experiment (Table
2).
The
average on-target value of UTRs of all samples (23%) is higher than
that of ORFs (12%) (Table 2), indicating that UTR sequences are more
easily captured, possibly because they were physically continuous in
genomes.
However,
the fluctuation of the on-target value of UTRs among samples (square
deviation = 6.19, n = 36) is higher than that of ORFs (square deviation
= 3.44, n = 36), suggesting that the capture efficiency of UTR sequences
may be more sensitive to genetic distance. We found that the capture
specificity of both ORF and UTR from a sample is negatively related to
its genetic distance to the probe species, and the regression slope of
UTR (coefficient = -204.39) is much smaller than that of ORF
(coefficient = -64.04) (Fig. 5a). These results showed that the capture
specificity of UTR decreases more rapidly than that of ORF with the
increase of genetic distance from the probe species, in line with that
noncoding sequences (UTR) evolve more rapidly than coding sequences
(ORF).
In our experiment, we did not perform normalization treatment for our
full-length cDNA probes to reduce the abundance of highly expressed
transcripts.
We
thus want to know whether our capture experiment will dominate by those
highly abundant cDNA probes. We found that, to some extent, the mean
capture depth of each target does appear to relate to the abundance of
its probe (Fig. 5b). However, the relationship between the capture depth
of one target and the abundance of its probe is not absolute because
there are also many ORF and UTR sequences with high capture depth
(> 1000) while their corresponding probes are not abundant
(< 100) (Fig. 5b). These
results indicated that unnormalized cDNA probes do not significantly
affect the ability of our method to sequence capture those target
sequences corresponding to low-expressed transcripts.
Detailed
statistics of the sequencing, contig assembly, gene recovery, nucleotide
recovery, and capture specificity for each sample are summarized in
Table 2.
On
the whole, the FLc-Capture method can simultaneously obtain thousands of
coding and noncoding sequences from both the ingroup and outgroup
samples, which indicates that our experimental design using homemade
full-length cDNA probes to capture both of genomic coding and noncoding
regions is successful.