Materials and Methods

Whole genome sequencing

We generated a whole genome assembly for a male Phyllostomus hastatus , PE091, collected in Jenaro Herrera, Peru. Field-collected tissues from Phyllostomus hastatus specimen PE091 were lawfully collected under permit #0122–2015–SERFOR–DGGSPFFS, exported under SERFOR permit #0002287, and imported under USFW 3-177 2015MI1694291.
Samples were preserved in RNAlater for one week before flash–freezing in a liquid nitrogen dry shipper, following previously published protocols (Yohe et al., 2019). High molecular weight genomic DNA was extracted from flash-frozen liver using the Qiamp DNA Micro Kit (Germantown, MD, USA) and sequenced on a PromethION instrument (Oxford Nanopore Technologies, New York, NY, USA) at Cold Spring Harbor Laboratory. Additionally, short-read Illumina whole genome sequencing was performed at Novogene, Inc (California, USA). Genomic DNA from lung was randomly fragmented to 350bp, end-repaired, adenylated, ligated with Illumina sequencing adapters, and further PCR–enriched. The final libraries were purified (AMPure XP system) and library quality and size verification were assessed on an Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA). Molar concentration was assessed using real-time PCR.
De novo genome assembly was performed using Flye v.2.7.1 (Kolmogorov, Yuan, Lin, & Pevzner, 2019) using default–nano-raw parameterization. The obtained pre-assembly was polished using Illumina short-reads with POLCA tool built-in MaSuRCA genome assembly and analysis toolkit (Zimin et al., 2013).

Genome database construction

Publicly-available genome assemblies for an additional 36 bat species (Supplementary Table 1) were downloaded from open-source databases to maximize bat taxonomic sampling (D. Dong et al., 2017; Eckalbar et al., 2016; Gutiérrez-Guerrero et al., 2020; Jebb et al., 2020; Parker et al., 2013; Seim et al., 2013; K. Wang et al., 2020; Zepeda Mendoza et al., 2018; G. Zhang et al., 2013). Assemblies were masked with RepeatMasker v.4.1.0 (Smit, Hubley, & Green, n.d.) using a custom library combining known mammalian transposable elements (TE) from Repbase (v20181026), ade novo mammalian TE library generated using assemblies from the Zoonomia Project (Genereux et al., 2020) and the Dfam database, and a custom bat–specific TE library generated by manual curation (Jebb et al., 2020).
All assemblies were annotated or re-annotated with the MAKER annotation pipeline v.2.31.10 (Holt & Yandell, 2011) to avoid bias in downstream analyses caused by differences in genome assembly annotation quality. Two iterations of MAKER were performed for each species. During the first run we provided expressed sequence tags (ESTs) and transcriptomic data as inputs (Davies et al., 2020; Potter et al., n.d.) (Supplementary Table 2). If species-specific transcriptomic data were unavailable, we used information from a related species of the same genus. We used two databases for protein homology the Uniprot/Swiss-Prot protein sequence database (Bateman, 2019) and a bat–specific protein database obtained from high-quality genome annotations for six bat species (Jebb et al., 2020). Repeat evidence was provided using the repeat annotation GFF3 file generated by RepeatMasker. Gene models generated on the first run were used for gene predictions with two gene software packages, SNAP (Korf, 2004) and Augustus (Stanke & Waack, 2003). Only gene models with an AED score < 0.25 and with more than 50 amino acids were retained. For the second run, focusing on re-annotation, the MAKER control file was edited to include the GFF3 output file from the first run gene predictions generated by SNAP and the Augustus gene prediction species model as inputs. Functional annotation was performed with BlastP (Camacho et al., 2009) using the Uniprot/Swiss-Prot database and protein domain annotation with InterProScan (Jones et al., 2014).

Homology inference

Protein homology was inferred among the proteins of 43 mammals:Homo sapiens, Mus musculus, Sus scrofa, Bos taurus, Equus caballus, Canis lupus familiaris, and the 37 bat species (Supplementary Table 1). Orthologous groups (orthogroups) were assigned with Orthofinder v.2.4.0 (Emms & Kelly, 2019). When no orthologs were inferred for the Chiroptera in a given orthogroup, we independently analyzed the genome data to confirm gene losses in bats. To this end, we performed a BLAST search against the 37 bat genomes using the following criteria: an e-value of 1e-6 and an identity and protein coverage greater than 80%. Then, genomic regions with a BLAST hit were extracted along with 200bp upstream and downstream. Sequences were aligned with the MAFFT aligner tool v.7.402 (Katoh & Standley, 2013) and visualized using Geneious version 11.1.3 (Kearse et al., 2012) to discriminate annotation errors. Additionally, BLAST searches were also performed against transcriptomic data from 22 bat species (Supplementary Table 2) (Potter et al., n.d.). For these searches, potential matches were filtered more strictly, and those with identity and protein coverage ≥ 90% were retained. Subsequent blast hit extraction, alignment and visualization were as for the genome searches.

Enrichment in chiropteran gene losses

We conducted pathway enrichment analyses with the final list of genes missing from all bat species using two databases: BioPlanet (R. Huang et al., 2019) and DICE GOnet (Pomaznoy, Ha, & Peters, 2018). In each case, we used the list of gene symbols as input with a cutoff value of 0.05 (BioPlanet) and a similar p–value in the DICE GOnet biological process classification for the mouse model. In both cases, all genes found to be missing were used as input and compared to a reference set of genes annotated in the corresponding database.

Gene family evolution

While previous analyses that included bat species have analyzed signals of positive selection across bats (e.g. Parker et al., 2013), fewer have explicitly centered on gene family evolution (Jebb et al., 2020; Tsagkogeorga et al., 2017). To analyze our comprehensive bat-focused sample, we modeled gene family expansions and contractions using CAFE (Computational Analysis of Gene Family Evolution) v.4.2.1 (M. V. Han, Thomas, Lugo-Martinez, & Hahn, 2013). The effect of genome quality on downstream gene predictions is well documented and leads to an overestimation of gene gains and losses (Denton et al., 2014; Tsagkogeorga et al., 2017). To mitigate the aforementioned bias, only genome assemblies with BUSCO completeness scores over 80%, totaling 34 species (28 bat species and 6 outgroup mammals) were used for CAFE. This smaller subset of protein sequences was filtered, retaining only the longest isoform. Homology clustering was performed with Orthofinder v.2.4.0 (Emms & Kelly, 2019).
CAFE fits a birth and death parameter (λ) to estimate the probability of gene gains or losses across a specified phylogeny (Hahn, De Bie, Stajich, Nguyen, & Cristianini, 2005). We first inferred an ultrametric phylogenomic tree based on 350 single copy orthologous genes (207,551 amino acid sites). All the orthologs were concatenated into a single 207,551–amino acid “contig” and sequence alignment was performed using the MAFFT aligner tool v.7.402 (Katoh & Standley, 2013). We evaluated the best-fit models of protein evolution with ProtTest v.3 (Darriba, Taboada, Doallo, & Posada, 2011) using two criteria: the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) (distribution JTT, +G +I +I +G and 80% consensus threshold). A maximum likelihood tree was inferred for the concatenated data set with RAxML v.8 (Stamatakis, 2014). Estimation of species divergence times was performed with Bayesian phylogenetic methods using the MCMCtree tool in the PAML v.4.9 package (Yang, 2007). We calibrated divergence dates using six points based on fossil records: 1)Icaronycteris , considered as one of the oldest echolocating fossil bats, dated at 52 Mya (Gunnell & Simmons, 2005; Simmons, Seymour, Habersetzer, & Gunnell, 2008); Tachypteron, the oldest known emballonurid fossil from the early Middle Eocene, with an age range of 48.6 to 40 Mya (Storch, Sigé, & Habersetzer, 2002);Hipposideros africanum , the oldest fossil record of the family Hipposideridae, its records date at 41.3 Mya (Ravel et al., 2016); Vespertillionidae indet. (41.3 Mya) (Eiting & Gunnell, 2009); Phyllostomidae indet. (30 Mya) (Nicholas J Czaplewski, 2010), andPalynephyllum (11.8 Mya) (Nicolas J Czaplewski, Takai, Naeher, & Setoguchi, 2003; Dávalos, Velazco, Warsi, Smits, & Simmons, 2014). Additionally, we included and corroborated the molecular dates for the base of the ingroup root estimated by Teeling et al. (2005).
We filtered the final input for CAFE to reduce systematic bias in inferring gene family evolution. First, we retained only gene families present at the most recent common ancestor of the phylogeny, with at least one gene present in each of the four clades assigned: a) Euarchontoglires (Homo sapiens and Mus musculus ), b) non-Chiroptera Laurasiatheria (Bos taurus, Canis familiaris, Equus caballus, Sus scrofa ), c) Yangochiroptera and d) Yinpterochiroptera. Second, gene families missing in more than 50% of bat species were excluded. Finally, families with large gene copy number variance (≥100 gene copies) were excluded for the global birth and death (λ) rate inference.
To analyze families with at least one gene copy across the taxa sampled, we first estimated a global λ for all branches. The global model was compared against a three multi-λ model that fits each lineage with its own gene family evolution rate. To test which model fits better with our dataset, we performed a likelihood ratio test for 100 gene family evolution simulations. We ran CAFE in error correction mode to account for genome assembly and annotation errors and estimate the global distribution of error with the assumption that all branches share a unique λ rate (λ=0.0033734) as described in Han et al. (2013). Finally, we used complementary tools; the Protein Analysis Through Evolutionary Relationships (PANTHER v.15) (Mi, Muruganujan, Ebert, Huang, & Thomas, 2019) and Gene Ontology Analysis (GOnet) to annotate genes with gene ontology (GO) terms (Ashburner et al., 2000; Carbon et al., 2019) and assign them to gene families, pathways, and biological process categories.

Selection tests

We identified genes under positive selection by evaluating 268 single–copy genes involved in immune response, based on a curated database of 1,793 genes downloaded from the IMMPORTDB repository (Bhattacharya et al., 2014) available at https://www.immport.org/home. Gene alignments were built with MAFFT v.7.402 (Katoh & Standley, 2013) and manually filtered to remove sequences with less than 70% of protein coverage based on the homologous human protein. Only alignments represented by at least 30% of the species were used for downstream analysis. For each gene in the codeml analyses, we built a phylogeny with RAxML (Stamatakis, 2014) and a codon alignment for each gene with PAL2NAL (Suyama, Torrents, & Bork, 2006).
We tested for evidence of positive selection among sites along bat lineages using the strict branch–site model (Yang, Wong, & Nielsen, 2005; J. Zhang, Nielsen, & Yang, 2005) with maximum-likelihood estimations implemented in codeml in PAML v.4.9 (Yang, 2007). We implemented model 2 as this allows the dN/dS ratio (ω) to vary across branches and sites and to detect if selection differs in a few amino acid residues in specific lineages (foreground branches). We compared two hypotheses, assigning the 37 bat species as foreground branches: 1) the null hypothesis with a fixed ω (ω=1) for all branches does not allow for positive selection, and 2) an alternative hypothesis assuming that the foreground branches have a greater proportion of sites under positive selection (ω > 1) than the background branches. The null hypothesis was tested against the alternative model with the likelihood-ratio test (LRT); the p-value was calculated under a chi-square distribution with 1 degree of freedom, additionally we adjusted the p-value using the false discovery rate (FDR) correction. To detect sites under positive selection, we used the Bayes Empirical Bayes (BEB) (Yang et al., 2005) approach to calculate posterior probabilities that a site has a significant value of ω >1. The residues with a high posterior probability (P > 95%) were considered.
To determine how robust the signals of positive selection detected were, we used the adaptive Branch-Site Random Effects Likelihood (aBSREL) (Smith et al., 2015) model, as implemented in HyPhy (Kosakovsky Pond, Frost, & Muse, 2005). The aBSREL model explores whether a proportion of sites have evolved under positive selection in each branch of the phylogeny, and was applied to all alignments using their respective gene trees. The false discovery rate method of multiple testing correction was applied to all p-values generated for each branch and gene.