The ORF and UTR data sets and the Colubridae phylogeny
From the FLc-Capture sequencing data, we extracted a total of 1,075 ORFs and 1,948 UTRs that have passed our filtering criteria (mean sequencing depth > 5 and containing at least seventeen taxa) and can be used for phylogenetic analysis. A summary of data characteristics for ORFs and UTRs, including length, taxa occupancy, GC content (the average GC content at the third codon position for each ORF and average full GC content for each UTR), percentage of missing data, is given in Appendix S4. The lengths of ORF alignments range from 168 to 9,063 bp (average = 760 bp) and the lengths of UTR alignments range from 107 to 3,011 bp (average = 572 bp) (Fig. 6a). In general, the UTRs have lower mean GC content and lower GC content variation (among genes and among species) than the ORFs (Fig. 6b). The UTR alignments have a higher pairwise distance than the ORF alignments, consistent with the expectation that noncoding sequences evolve more rapidly than coding sequences (Fig. 6c). Multidimensional scaling plots of the RF-distance among genes (Fig. 6d) indicated that the ORF gene trees were more similar to each other compared with the UTR gene trees, but phylogenetic signals among ORFs or UTRs are overall rather congruent.
The concatenated supermatrix of ORFs is 817,164 bp in length and 72.8% complete by characters, while the concatenated supermatrix of UTRs is 1,114,278 bp in length and 78.2% complete by characters. The ML trees inferred from the ORF and UTR data sets are identical and well resolved, with at least 85% of nodes having > 95% bootstrap (BS) support (Fig. 7). The backbone phylogeny among the snake families sampled in this study is congruent with that reported by many previous studies (e.g., Pyron et al., 2014; Zheng & Wiens 2015). Within the family Colubridae, we recognized three major clades: (A) Dipsadinae + Pseudoxenodontinae, (B) Natricinae, and (C) Sibynophiinae + (Calamariinae + (Ahaetuliinae+ Colubrinae)). These three clades were repeatedly found in previous studies, but the relationship among them was not well-supported and different in those studies (Burbrink et al., 2020; Li et al., 2020; Pyron et al., 2014; Wiens et al., 2012). For example, Wiens et al. (2012) used 44 nuclear genes but did not resolve the relationships among these three clades. Both Pyron et al. (2014) and Burbrink et al. (2020) used hundreds of AHE loci and resolved the relationship as (C,(A,B)) (former: weakly supported; BS = 65%, later: posterior probabilities = 0.88). Li et al. (2020) used 96 mitochondrial and nuclear genes but found the relationship is (A,(B,C)) (weakly supported; BS = 46%). Different from the previous results, both our ORF and UTR data sets favored a relationship of (B,(A,C)), and this result is strongly supported by the ORF data set (BS=100%; Fig. 7).
Our phylogenomic analysis provided a highly resolved phylogeny for colubrid snakes, for the first time, based on extensive sampling of both genes and species. Thanks to the characteristics of the FLc-Capture method, we were able to simultaneously collect genome-scale coding and noncoding data to study the phylogeny of Colubridae. Although some nodes of our resulting phylogeny were not conclusively supported in all analyses, they received high support (ML bootstrap > 85%) from at least one type of data set, which shows the benefit of simultaneously using both coding and noncoding data sets for studying difficult phylogenetic questions.