Materials and Methods
Whole genome sequencing
We generated a whole genome assembly for a male Phyllostomus
hastatus , PE091, collected in Jenaro Herrera, Peru. Field-collected
tissues from Phyllostomus hastatus specimen PE091 were lawfully
collected under permit #0122–2015–SERFOR–DGGSPFFS, exported under
SERFOR permit #0002287, and imported under USFW 3-177 2015MI1694291.
Samples were preserved in RNAlater for one week before flash–freezing
in a liquid nitrogen dry shipper, following previously published
protocols (Yohe et al., 2019). High molecular weight genomic DNA was
extracted from flash-frozen liver using the Qiamp DNA Micro Kit
(Germantown, MD, USA) and sequenced on a PromethION instrument (Oxford
Nanopore Technologies, New York, NY, USA) at Cold Spring Harbor
Laboratory. Additionally, short-read Illumina whole genome sequencing
was performed at Novogene, Inc (California, USA). Genomic DNA from lung
was randomly fragmented to 350bp, end-repaired, adenylated, ligated with
Illumina sequencing adapters, and further PCR–enriched. The final
libraries were purified (AMPure XP system) and library quality and size
verification were assessed on an Agilent 2100 Bioanalyzer (Agilent
Technologies, CA, USA). Molar concentration was assessed using real-time
PCR.
De novo genome assembly was performed using Flye v.2.7.1
(Kolmogorov, Yuan, Lin, & Pevzner, 2019) using default–nano-raw parameterization. The obtained pre-assembly was
polished using Illumina short-reads with POLCA tool built-in MaSuRCA
genome assembly and analysis toolkit (Zimin et al., 2013).
Genome database construction
Publicly-available genome assemblies for an additional 36 bat species
(Supplementary Table 1) were downloaded from open-source databases to
maximize bat taxonomic sampling (D. Dong et al., 2017; Eckalbar et al.,
2016; Gutiérrez-Guerrero et al., 2020; Jebb et al., 2020; Parker et al.,
2013; Seim et al., 2013; K. Wang et al., 2020; Zepeda Mendoza et al.,
2018; G. Zhang et al., 2013). Assemblies were masked with RepeatMasker
v.4.1.0 (Smit, Hubley, & Green, n.d.) using a custom library combining
known mammalian transposable elements (TE) from Repbase (v20181026), ade novo mammalian TE library generated using assemblies from the
Zoonomia Project (Genereux et al., 2020) and the Dfam database, and a
custom bat–specific TE library generated by manual curation (Jebb et
al., 2020).
All assemblies were annotated or re-annotated with the MAKER annotation
pipeline v.2.31.10 (Holt & Yandell, 2011) to avoid bias in downstream
analyses caused by differences in genome assembly annotation quality.
Two iterations of MAKER were performed for each species. During the
first run we provided expressed sequence tags (ESTs) and transcriptomic
data as inputs (Davies et al., 2020; Potter et al., n.d.) (Supplementary
Table 2). If species-specific transcriptomic data were unavailable, we
used information from a related species of the same genus. We used two
databases for protein homology the Uniprot/Swiss-Prot protein sequence
database (Bateman, 2019) and a bat–specific protein database obtained
from high-quality genome annotations for six bat species (Jebb et al.,
2020). Repeat evidence was provided using the repeat annotation GFF3
file generated by RepeatMasker. Gene models generated on the first run
were used for gene predictions with two gene software packages, SNAP
(Korf, 2004) and Augustus (Stanke & Waack, 2003). Only gene models with
an AED score < 0.25 and with more than 50 amino acids were
retained. For the second run, focusing on re-annotation, the MAKER
control file was edited to include the GFF3 output file from the first
run gene predictions generated by SNAP and the Augustus gene prediction
species model as inputs. Functional annotation was performed with BlastP
(Camacho et al., 2009) using the Uniprot/Swiss-Prot database and protein
domain annotation with InterProScan (Jones et al., 2014).
Homology inference
Protein homology was inferred among the proteins of 43 mammals:Homo sapiens, Mus musculus, Sus scrofa, Bos taurus, Equus
caballus, Canis lupus familiaris, and the 37 bat species (Supplementary
Table 1). Orthologous groups (orthogroups) were assigned with
Orthofinder v.2.4.0 (Emms & Kelly, 2019). When no orthologs were
inferred for the Chiroptera in a given orthogroup, we independently
analyzed the genome data to confirm gene losses in bats. To this end, we
performed a BLAST search against the 37 bat genomes using the following
criteria: an e-value of 1e-6 and an identity and protein coverage
greater than 80%. Then, genomic regions with a BLAST hit were extracted
along with 200bp upstream and downstream. Sequences were aligned with
the MAFFT aligner tool v.7.402 (Katoh & Standley, 2013) and visualized
using Geneious version 11.1.3 (Kearse et al., 2012) to discriminate
annotation errors. Additionally, BLAST searches were also performed
against transcriptomic data from 22 bat species (Supplementary Table 2)
(Potter et al., n.d.). For these searches, potential matches were
filtered more strictly, and those with identity and protein coverage ≥
90% were retained. Subsequent blast hit extraction, alignment and
visualization were as for the genome searches.
Enrichment in chiropteran gene losses
We conducted pathway enrichment analyses with the final list of genes
missing from all bat species using two databases: BioPlanet (R. Huang et
al., 2019) and DICE GOnet (Pomaznoy, Ha, & Peters, 2018). In each case,
we used the list of gene symbols as input with a cutoff value of 0.05
(BioPlanet) and a similar p–value in the DICE GOnet biological process
classification for the mouse model. In both cases, all genes found to be
missing were used as input and compared to a reference set of genes
annotated in the corresponding database.
Gene family evolution
While previous analyses that included bat species have analyzed signals
of positive selection across bats (e.g. Parker et al., 2013), fewer have
explicitly centered on gene family evolution (Jebb et al., 2020;
Tsagkogeorga et al., 2017). To analyze our comprehensive bat-focused
sample, we modeled gene family expansions and contractions using CAFE
(Computational Analysis of Gene Family Evolution) v.4.2.1 (M. V. Han,
Thomas, Lugo-Martinez, & Hahn, 2013). The effect of genome quality on
downstream gene predictions is well documented and leads to an
overestimation of gene gains and losses (Denton et al., 2014;
Tsagkogeorga et al., 2017). To mitigate the aforementioned bias, only
genome assemblies with BUSCO completeness scores over 80%, totaling 34
species (28 bat species and 6 outgroup mammals) were used for CAFE. This
smaller subset of protein sequences was filtered, retaining only the
longest isoform. Homology clustering was performed with Orthofinder
v.2.4.0 (Emms & Kelly, 2019).
CAFE fits a birth and death parameter (λ) to estimate the probability of
gene gains or losses across a specified phylogeny (Hahn, De Bie,
Stajich, Nguyen, & Cristianini, 2005). We first inferred an ultrametric
phylogenomic tree based on 350 single copy orthologous genes (207,551
amino acid sites). All the orthologs were concatenated into a single
207,551–amino acid “contig” and sequence alignment was performed
using the MAFFT aligner tool v.7.402 (Katoh & Standley, 2013). We
evaluated the best-fit models of protein evolution with ProtTest v.3
(Darriba, Taboada, Doallo, & Posada, 2011) using two criteria: the
Akaike Information Criterion (AIC) and the Bayesian Information
Criterion (BIC) (distribution JTT, +G +I +I +G and 80% consensus
threshold). A maximum likelihood tree was inferred for the concatenated
data set with RAxML v.8 (Stamatakis, 2014). Estimation of species
divergence times was performed with Bayesian phylogenetic methods using
the MCMCtree tool in the PAML v.4.9 package (Yang, 2007). We calibrated
divergence dates using six points based on fossil records: 1)Icaronycteris , considered as one of the oldest echolocating
fossil bats, dated at 52 Mya (Gunnell & Simmons, 2005; Simmons,
Seymour, Habersetzer, & Gunnell, 2008); Tachypteron, the oldest
known emballonurid fossil from the early Middle Eocene, with an age
range of 48.6 to 40 Mya (Storch, Sigé, & Habersetzer, 2002);Hipposideros africanum , the oldest fossil record of the family
Hipposideridae, its records date at 41.3 Mya (Ravel et al., 2016);
Vespertillionidae indet. (41.3 Mya) (Eiting & Gunnell, 2009);
Phyllostomidae indet. (30 Mya) (Nicholas J Czaplewski, 2010), andPalynephyllum (11.8 Mya) (Nicolas J Czaplewski, Takai, Naeher, &
Setoguchi, 2003; Dávalos, Velazco, Warsi, Smits, & Simmons, 2014).
Additionally, we included and corroborated the molecular dates for the
base of the ingroup root estimated by Teeling et al. (2005).
We filtered the final input for CAFE to reduce systematic bias in
inferring gene family evolution. First, we retained only gene families
present at the most recent common ancestor of the phylogeny, with at
least one gene present in each of the four clades assigned: a)
Euarchontoglires (Homo sapiens and Mus musculus ), b)
non-Chiroptera Laurasiatheria (Bos taurus, Canis familiaris, Equus
caballus, Sus scrofa ), c) Yangochiroptera and d) Yinpterochiroptera.
Second, gene families missing in more than 50% of bat species were
excluded. Finally, families with large gene copy number variance (≥100
gene copies) were excluded for the global birth and death (λ) rate
inference.
To analyze families with at least one gene copy across the taxa sampled,
we first estimated a global λ for all branches. The global model was
compared against a three multi-λ model that fits each lineage with its
own gene family evolution rate. To test which model fits better with our
dataset, we performed a likelihood ratio test for 100 gene family
evolution simulations. We ran CAFE in error correction mode to account
for genome assembly and annotation errors and estimate the global
distribution of error with the assumption that all branches share a
unique λ rate (λ=0.0033734) as described in Han et al. (2013). Finally,
we used complementary tools; the Protein Analysis Through Evolutionary
Relationships (PANTHER v.15) (Mi, Muruganujan, Ebert, Huang, & Thomas,
2019) and Gene Ontology Analysis (GOnet) to annotate genes with gene
ontology (GO) terms (Ashburner et al., 2000; Carbon et al., 2019) and
assign them to gene families, pathways, and biological process
categories.
Selection tests
We identified genes under positive selection by evaluating 268
single–copy genes involved in immune response, based on a curated
database of 1,793 genes downloaded from the IMMPORTDB repository
(Bhattacharya et al., 2014) available at
https://www.immport.org/home. Gene alignments were built with
MAFFT v.7.402 (Katoh & Standley, 2013) and manually filtered to remove
sequences with less than 70% of protein coverage based on the
homologous human protein. Only alignments represented by at least 30%
of the species were used for downstream analysis. For each gene in the
codeml analyses, we built a phylogeny with RAxML (Stamatakis, 2014) and
a codon alignment for each gene with PAL2NAL (Suyama, Torrents, & Bork,
2006).
We tested for evidence of positive selection among sites along bat
lineages using the strict branch–site model (Yang, Wong, & Nielsen,
2005; J. Zhang, Nielsen, & Yang, 2005) with maximum-likelihood
estimations implemented in codeml in PAML v.4.9 (Yang, 2007). We
implemented model 2 as this allows the dN/dS ratio (ω) to vary across
branches and sites and to detect if selection differs in a few amino
acid residues in specific lineages (foreground branches). We compared
two hypotheses, assigning the 37 bat species as foreground branches: 1)
the null hypothesis with a fixed ω (ω=1) for all branches does not allow
for positive selection, and 2) an alternative hypothesis assuming that
the foreground branches have a greater proportion of sites under
positive selection (ω > 1) than the background branches.
The null hypothesis was tested against the alternative model with the
likelihood-ratio test (LRT); the p-value was calculated under a
chi-square distribution with 1 degree of freedom, additionally we
adjusted the p-value using the false discovery rate (FDR) correction. To
detect sites under positive selection, we used the Bayes Empirical Bayes
(BEB) (Yang et al., 2005) approach to calculate posterior probabilities
that a site has a significant value of ω >1. The residues
with a high posterior probability (P > 95%) were
considered.
To determine how robust the signals of positive selection detected were,
we used the adaptive Branch-Site Random Effects Likelihood (aBSREL)
(Smith et al., 2015) model, as implemented in HyPhy (Kosakovsky Pond,
Frost, & Muse, 2005). The aBSREL model explores whether a proportion of
sites have evolved under positive selection in each branch of the
phylogeny, and was applied to all alignments using their respective gene
trees. The false discovery rate method of multiple testing correction
was applied to all p-values generated for each branch and gene.