Abstract
Resolving intractable phylogenetic relationships often requires simultaneously analyzing a large number of coding and noncoding orthologous loci. To gather both coding and noncoding data, traditional sequence capture methods require custom-designed commercial probes. Here, we develop a cost-effective sequence capture method based on homemade probes, to capture thousands of coding and noncoding orthologous loci simultaneously, suitable for all organisms. This approach, called ”FLc-Capture”, synthesizes biotinylated full-length cDNAs from mRNA as capture probes, eliminates the need for costly commercial probe design and synthesis. To demonstrate the utility of FLc-Capture, we prepared full-length cDNA probes from mRNA extracted from a common colubrid snake. We performed capture experiments with these homemade cDNA probes and successfully obtained thousands of coding and noncoding genomic loci from 24 Colubridae species and 12 distantly related snake species of other families. The average capture specificity of FLc-Capture across all test snake species is 35%, similar to the previously published EecSeq method. We constructed two phylogenomic data sets, one including 1,075 coding loci (~817,000 bp) and another including 1,948 noncoding loci (~1,114,000 bp), to study the phylogeny of Colubridae. Both data sets yielded highly similar and well-resolved trees, with 85% of nodes having > 95% bootstrap support. Our experimental tests indicated that FLc-Capture is a flexible, fast, and cost-effective sequence capture approach for simultaneously gathering coding and noncoding phylogenomic data sets to study intractable phylogenetic questions. We anticipate that this method can provide a new data collection tool for the evolutionary biologists working in the era of high-throughput sequencing.
Keywords: high-throughput sequencing, sequence capture, transcriptome, snake, phylogeny
Introduction
Resolving difficult phylogenetic questions usually requires genome-scale data. However, large data sets do not necessarily lead to correct results because accurate phylogenetic inference relies on the correct evolution model. A subtle model violation may be sufficient to mislead phylogenetic inference when data is big (Hahn & Nakhleh 2016). Therefore, when addressing difficult phylogenetic questions, to avoid highly supported but wrong phylogenetic inference, in addition to careful model selection and data refinement, it is often desirable to analyze several independent phylogenomic data sets for consistency. In genomes, coding sequences and noncoding sequences have different evolutionary characteristics and are relatively independent data sources. As a result, it is becoming increasingly popular to simultaneously analyze both coding and noncoding genomic data in many recent phylogenomic studies (Chen et al., 2017; Jarvis et al., 2014; Reddy et al., 2017).
Whole-genome shotgun (WGS) sequencing is the simplest way to obtain coding and noncoding phylogenomic data simultaneously, but it is still cost-prohibitive to sequence dozens or hundreds of full genomes despite the rapid progress of sequencing technology. In fact, because phylogenomic studies do not need fully-assembled genomes but only phylogenetically informative loci, low-coverage WGS sequencing is generally sufficient to meet the basic requirements for phylogenomic studies. Until now, there are three main approaches for extracting phylogenetically informative loci from low-coverage WGS data. The first approach, called ”automated Target Restricted Assembly Method (aTRAM),” assembles WGS data into predefined targeted regions by selecting reads with iterative BLAST searches (Allen et al., 2017). This method has been demonstrated to be able to extract over a thousand loci from 5-10× coverage WGS data of sucking lice (genome size 100-150Mbp). However, this method is more suitable for species with small genomes, since iterative BLAST searches will be too computationally intensive with large datasets. The second approach directly extracts phylogenomic data (coding and noncoding) from low-coverage WGS data by assembling entire genomes (Allio et al., 2019; Hughes & Teeling, 2018; Zhang et al., 2019). Zhang et al. (2019) showed that, for species with small genomes (0.1-1 G), 10-20× coverage WGS data is sufficient to extract hundreds to thousands of phylogenetic loci. However, this method is still not suitable for organisms with large genomes (> 1 G) because de novo genome assembly is highly difficult under this situation. The third approach does not extract phylogenomic loci by assembling genomes but extract single nucleotide polymorphisms (SNPs) from low-coverage WGS data by mapping reads to reference genomes. Olofsson et al. (2018) used this strategy to study the phylogeny of the olives that have relatively large genomes (~ 1.5 G). The shortcoming of this method is that it requires annotated reference genomes and tends to perform relatively poorly across highly divergent lineages. Currently, although low-coverage WGS sequencing has shown great promise in constructing phylogenomic data sets, it is still somewhat challenging to apply it in organisms with large genomes.
Two sequencing methods perform better than genome shotgun sequencing in generating phylogenomic data from large genome species: transcriptome sequencing (Morozova et al., 2009; Wang et al., 2009) and sequence capture (Faircloth et al., 2012; Glenn et al., 2016; Jones et al., 2016; Lemmon et al., 2012; Lemmon & Lemmon 2013). The target of transcriptome sequencing is expressed mRNAs whose size does not vary significantly, no matter how large the genome size is. Because mRNAs contain both open reading frames (ORFs) and untranslated regions (3’ UTR and 5’ UTR), transcriptome sequencing can enable researchers to obtain a large amount of coding and noncoding sequences simultaneously (Garrison et al., 2016; Misof et al., 2014; Oakley et al., 2012). However, transcriptome sequencing requires fresh or properly stored tissues to provide high-quality RNA, which often limits the number of taxa included in such phylogenomic studies (Lemmon & Lemmon 2013; McCormack et al., 2013). Sequence capture uses biotinylated probes to enrich the target regions of the genome of interest selectively. It allows researchers to attain higher sequencing depth over a predefined subset of the genome for a given cost, particularly helpful to species with large genomes (Mccartney-melstad et al., 2016). An advantage of sequence capture is that it does not require high-quality DNA samples and can handle highly degraded DNA extracted from old museum specimens (e.g., Blaimer et al., 2016; Guschanski et al., 2013). This property can greatly increase the sampling number of taxa in a phylogenomic study. Moreover, sequence capture is also very flexible. Many capture methods have been developed for various purposes, such as ultra-conserved element (UCE) sequencing (Faircloth et al., 2012) for collecting noncoding sequences, anchored hybrid enrichment (AHE; Lemmon et al., 2012) and exon capture (Albert et al., 2007; Bi et al., 2012; Ng et al., 2009) for collecting coding sequences, and a combination of AHE and UCE for collecting both coding and noncoding sequences simultaneously (Singhal et al., 2017). However, most current sequence capture methods require the researcher to have prior genomic information for probe design and then to synthesize the probes through commercial companies. For nonmodel species, the probe design is often difficult due to a lack of genome information. Also, the cost of using commercial probes will be high when a research project has hundreds of samples or more, probably reaching several thousands of dollars.
Recently, Puritz and Lotterhos (2017) demonstrated that cDNA fragments could be used as capture probes to capture coding sequences from genomes. Using cDNAs by reverse transcription from mRNAs as probes to sequence capture genomes can avoid using commercial probes, thus greatly reduce the cost of experiments. The method of Puritz and Lotterhos (EecSeq) only focuses on capturing coding regions, and the experiment design and bioinformatic pipeline all revolve around how to obtain exonic SNPs. In fact, full-length cDNA sequences consist of coding ORFs and noncoding UTRs. If both ORFs and UTRs are considered in the cDNA probe preparation, genomic DNA of both coding and noncoding regions can be captured and sequenced simultaneously. The direct use of full-length cDNAs as probes for sequence capture can produce transcriptome-level data and skips the step of probe design, which is particularly suitable for nonmodel organisms lacking of genomic information. Moreover, it can allow investigators for simultaneously obtaining coding and noncoding phylogenomic data, and thus will be helpful for studying difficult phylogenetic questions.
In this study, we present a novel sequence capture method based on homemade probes, called ”full-length cDNA capture sequencing” (FLc-Capture). It is a universal, flexible, and cost-effective sequence capture method that works for all organism groups. The most distinctive feature of this method is to use the SMART technology (Clontech Inc.) to synthesize full-length cDNAs and then created biotinylated probes from cDNAs. The specially designed bioinformatics analysis scheme enables users to extract a large number of genomic loci (both coding and noncoding) from the capture data without any genome knowledge of the taxa been investigated. To demonstrate the utility of the FLc-Capture method, we used it to study the phylogeny of the family Colubridae (Serpentes: Caenophidia), a rapid radiation lineage with large genomes (~2 G). We successfully obtained hundreds to thousands of coding and noncoding genomic loci from dozens of colubrid and distantly related outgroup snake species from the FLc-Capture data. These coding and noncoding phylogenomic data were able to reconstruct a robust phylogeny of Colubridae and addressed the long-debated relationships among subfamilies. We anticipate the method presented in this study can provide a new high-throughput sequencing approach for studies seeking to resolve difficult phylogenetic questions.