Abstract
Resolving
intractable phylogenetic relationships often requires simultaneously
analyzing a large number of coding and noncoding orthologous loci. To
gather both coding and noncoding data, traditional sequence capture
methods require custom-designed commercial probes. Here, we develop a
cost-effective sequence capture method based on homemade probes, to
capture thousands of coding and noncoding orthologous loci
simultaneously, suitable for all organisms. This approach, called
”FLc-Capture”, synthesizes biotinylated full-length cDNAs from mRNA as
capture probes, eliminates the need for costly commercial probe design
and synthesis. To demonstrate the
utility of FLc-Capture, we prepared full-length cDNA probes from mRNA
extracted from a common colubrid snake. We performed capture experiments
with these homemade cDNA probes and successfully obtained thousands of
coding and noncoding genomic loci from 24 Colubridae species and 12
distantly related snake species of other families. The average capture
specificity of FLc-Capture across all test snake species is 35%,
similar to the previously published EecSeq method. We constructed two
phylogenomic data sets, one including 1,075 coding loci
(~817,000 bp) and another including 1,948 noncoding loci
(~1,114,000 bp), to study the phylogeny of Colubridae.
Both data sets yielded highly similar and well-resolved trees, with 85%
of nodes having > 95% bootstrap support. Our experimental
tests indicated that FLc-Capture is a flexible, fast, and cost-effective
sequence capture approach for simultaneously gathering coding and
noncoding phylogenomic data sets to study intractable phylogenetic
questions. We anticipate that this method can provide a new data
collection tool for the evolutionary biologists working in the era of
high-throughput sequencing.
Keywords: high-throughput sequencing, sequence capture,
transcriptome, snake,
phylogeny
Introduction
Resolving difficult phylogenetic questions usually requires genome-scale
data. However, large data sets do not necessarily lead to correct
results because accurate phylogenetic inference relies on the correct
evolution model. A subtle model violation may be sufficient to mislead
phylogenetic inference when data is big (Hahn & Nakhleh 2016).
Therefore, when addressing difficult phylogenetic questions, to avoid
highly supported but wrong phylogenetic inference, in addition to
careful model selection and data refinement, it is often desirable to
analyze several independent phylogenomic data sets for consistency.
In genomes, coding sequences and
noncoding sequences have different evolutionary characteristics and are
relatively independent data sources. As a result, it is becoming
increasingly popular to simultaneously analyze both coding and noncoding
genomic data in many recent phylogenomic studies (Chen et al., 2017;
Jarvis et al., 2014; Reddy et al., 2017).
Whole-genome
shotgun (WGS) sequencing is the simplest way to obtain coding and
noncoding phylogenomic data simultaneously, but it is still
cost-prohibitive to sequence
dozens or hundreds of full genomes despite the rapid progress of
sequencing technology. In fact,
because phylogenomic studies do not need fully-assembled genomes but
only phylogenetically informative loci, low-coverage WGS sequencing is
generally sufficient to meet the basic requirements for phylogenomic
studies. Until now, there are three
main approaches for extracting phylogenetically informative loci from
low-coverage WGS data. The first approach, called ”automated Target
Restricted Assembly Method (aTRAM),” assembles WGS data into predefined
targeted regions by selecting reads with iterative BLAST searches (Allen
et al., 2017). This method has been demonstrated to be able to extract
over a thousand loci from 5-10× coverage WGS data of sucking lice
(genome size 100-150Mbp). However,
this method is more suitable for species with small genomes, since
iterative BLAST searches will be too computationally intensive with
large datasets. The second approach
directly extracts phylogenomic data (coding and noncoding) from
low-coverage WGS data by assembling entire genomes (Allio et al., 2019;
Hughes & Teeling, 2018; Zhang et
al., 2019). Zhang et al. (2019) showed that, for species with small
genomes (0.1-1 G), 10-20× coverage WGS data is sufficient to extract
hundreds to thousands of phylogenetic loci. However, this method is
still not suitable for organisms with large genomes (> 1 G)
because de novo genome assembly is highly difficult under this
situation. The third approach does not extract phylogenomic loci by
assembling genomes but extract single nucleotide polymorphisms (SNPs)
from low-coverage WGS data by mapping reads to reference genomes.
Olofsson et al. (2018) used this strategy to study the phylogeny of the
olives that have relatively large genomes (~ 1.5 G). The
shortcoming of this method is that it requires annotated reference
genomes and tends to perform relatively poorly across highly divergent
lineages.
Currently,
although low-coverage WGS sequencing has shown great promise in
constructing phylogenomic data sets, it is still somewhat challenging to
apply it in organisms with large genomes.
Two
sequencing methods perform better than genome shotgun sequencing in
generating phylogenomic data from large genome species: transcriptome
sequencing (Morozova et al., 2009; Wang et al., 2009) and sequence
capture (Faircloth et al., 2012; Glenn et al., 2016; Jones et al., 2016;
Lemmon et al., 2012; Lemmon & Lemmon
2013).
The target of
transcriptome sequencing is
expressed mRNAs whose size does not vary significantly, no matter how
large the genome size is.
Because
mRNAs contain both open reading frames (ORFs) and untranslated regions
(3’ UTR and 5’ UTR), transcriptome sequencing can enable researchers to
obtain a large amount of coding and noncoding sequences simultaneously
(Garrison et al., 2016; Misof et al., 2014; Oakley et al.,
2012).
However, transcriptome sequencing
requires fresh or properly stored tissues to provide high-quality RNA,
which often limits the number of taxa included in such phylogenomic
studies (Lemmon & Lemmon 2013; McCormack et al., 2013).
Sequence
capture uses biotinylated probes to enrich the target regions of the
genome of interest selectively. It
allows researchers to attain higher sequencing depth over a predefined
subset of the genome for a given cost, particularly helpful to species
with large genomes (Mccartney-melstad et al.,
2016).
An advantage of sequence capture is that it does not require
high-quality DNA samples and can handle highly degraded DNA extracted
from old museum specimens (e.g., Blaimer et al., 2016; Guschanski et
al., 2013).
This
property can greatly increase the
sampling number of taxa in a phylogenomic
study.
Moreover,
sequence capture is also very flexible. Many capture methods have been
developed for various purposes,
such as ultra-conserved element (UCE) sequencing (Faircloth et al.,
2012) for collecting noncoding
sequences, anchored hybrid enrichment (AHE; Lemmon et al., 2012)
and exon capture (Albert et al.,
2007; Bi et al., 2012; Ng et al., 2009)
for collecting coding sequences,
and a combination of AHE and UCE for collecting both coding and
noncoding sequences simultaneously (Singhal et al., 2017).
However,
most current sequence capture methods require the researcher to have
prior genomic information for probe design and then to synthesize the
probes through commercial companies.
For nonmodel species, the probe
design is often difficult due to a lack of genome information.
Also,
the cost of using commercial probes
will be high when a research project has hundreds of samples or more,
probably reaching several thousands of dollars.
Recently,
Puritz and Lotterhos (2017) demonstrated that cDNA fragments could be
used as capture probes to capture coding sequences from genomes. Using
cDNAs by reverse transcription from mRNAs as probes to sequence capture
genomes can avoid using commercial probes, thus greatly reduce the cost
of experiments. The method of Puritz and Lotterhos (EecSeq) only focuses
on capturing coding regions, and the experiment design and bioinformatic
pipeline all revolve around how to obtain exonic
SNPs.
In fact, full-length cDNA
sequences consist of coding ORFs and noncoding UTRs.
If
both ORFs and UTRs are considered in the cDNA probe preparation, genomic
DNA of both coding and noncoding regions can be captured and sequenced
simultaneously.
The direct use of full-length cDNAs
as probes for sequence capture can produce transcriptome-level data and
skips the step of probe design, which is particularly suitable for
nonmodel organisms lacking of genomic information. Moreover, it can
allow investigators for simultaneously obtaining coding and noncoding
phylogenomic data, and thus will be helpful for studying difficult
phylogenetic questions.
In
this study, we present a novel sequence capture method based on homemade
probes, called ”full-length cDNA capture sequencing” (FLc-Capture). It
is a universal, flexible, and cost-effective sequence capture method
that works for all organism groups. The most distinctive feature of this
method is to use the SMART technology (Clontech Inc.) to synthesize
full-length cDNAs and then created biotinylated probes from cDNAs. The
specially designed bioinformatics analysis scheme enables users to
extract a large number of genomic loci (both coding and noncoding) from
the capture data without any genome knowledge of the taxa been
investigated. To demonstrate the utility of the FLc-Capture method, we
used it to study the phylogeny of the family Colubridae (Serpentes:
Caenophidia), a rapid radiation lineage with large genomes
(~2 G). We successfully obtained hundreds to thousands
of coding and noncoding genomic loci from dozens of colubrid and
distantly related outgroup snake species from the FLc-Capture data.
These coding and noncoding phylogenomic data were able to reconstruct a
robust phylogeny of Colubridae and addressed the long-debated
relationships among subfamilies. We anticipate the method presented in
this study can provide a new high-throughput sequencing approach for
studies seeking to resolve difficult phylogenetic questions.