The ORF and UTR data sets and the Colubridae phylogeny
From
the FLc-Capture sequencing data, we extracted a total of 1,075 ORFs and
1,948 UTRs that have passed our filtering criteria (mean sequencing
depth > 5 and containing at least seventeen taxa) and can
be used for phylogenetic analysis.
A
summary of data characteristics for ORFs and UTRs, including length,
taxa occupancy, GC content (the average GC content at the third codon
position for each ORF and average full GC content for each UTR),
percentage of missing data, is given in Appendix S4. The lengths of ORF
alignments range from 168 to 9,063 bp (average = 760 bp) and the lengths
of UTR alignments range from 107 to 3,011 bp (average = 572 bp) (Fig.
6a). In general, the UTRs have lower
mean GC content and lower GC content variation (among genes and among
species) than the ORFs (Fig. 6b). The UTR alignments have a higher
pairwise distance than the ORF alignments, consistent with the
expectation that noncoding sequences evolve more rapidly than coding
sequences (Fig. 6c).
Multidimensional
scaling plots of the RF-distance among genes (Fig. 6d) indicated that
the ORF gene trees were more similar to each other compared with the UTR
gene trees, but phylogenetic signals among ORFs or UTRs are overall
rather congruent.
The concatenated supermatrix of ORFs is 817,164 bp in length and 72.8%
complete by characters, while the concatenated supermatrix of UTRs is
1,114,278 bp in length and 78.2% complete by characters.
The ML trees inferred from the ORF
and UTR data sets are identical and well resolved, with at least 85% of
nodes having > 95% bootstrap (BS) support (Fig. 7).
The
backbone phylogeny among the snake families sampled in this study is
congruent with that reported by many previous studies (e.g., Pyron et
al., 2014; Zheng & Wiens 2015). Within the family Colubridae, we
recognized three major clades: (A) Dipsadinae + Pseudoxenodontinae, (B)
Natricinae, and (C) Sibynophiinae + (Calamariinae + (Ahaetuliinae+
Colubrinae)). These three clades
were repeatedly found in previous studies, but the relationship among
them was not well-supported and different in those studies (Burbrink et
al., 2020; Li et al., 2020; Pyron et al., 2014; Wiens et al., 2012).
For
example, Wiens et al. (2012) used 44 nuclear genes but did not resolve
the relationships among these three clades. Both Pyron et al. (2014) and
Burbrink et al. (2020) used hundreds of AHE loci and resolved the
relationship as (C,(A,B)) (former: weakly supported; BS = 65%, later:
posterior probabilities = 0.88). Li et al. (2020) used 96 mitochondrial
and nuclear genes but found the relationship is (A,(B,C)) (weakly
supported; BS = 46%).
Different
from the previous results, both our ORF and UTR data sets favored a
relationship of (B,(A,C)), and this
result is strongly supported by the ORF data set (BS=100%; Fig. 7).
Our
phylogenomic analysis provided a highly resolved phylogeny for colubrid
snakes, for the first time, based on extensive sampling of both genes
and species. Thanks to the characteristics of the FLc-Capture method, we
were able to simultaneously collect
genome-scale coding and noncoding data to study the phylogeny of
Colubridae. Although some nodes of our resulting phylogeny were not
conclusively supported in all analyses, they received high support (ML
bootstrap > 85%) from at least one type of data set, which
shows the benefit of simultaneously using both coding and noncoding data
sets for studying difficult phylogenetic questions.