DNA sequencing protocols and assembly procedures
All specimens were preserved in 96% ethanol until processed for DNA sequencing. We extracted genomic DNA from whole specimens using the EZ-10 Spin Column Genomic DNA Minipreps kit (BIO BASIC®, Toronto, Canada) and quantified it using the Qubit fluorometer system (High Sensitivity DNA kit, Life Technologies Inc., Carlsbad, CA, USA). We used 1:10 and 1:30 dilutions of DNA template for Sanger sequencing, and followed the procedures described by Aguilar-Velasco et al. (2016) forcox1 amplification and sequencing. Sequences were edited and aligned based on their translated amino acids with the program Geneious version 10.1 (Biomatters, Ltd., Aukland, New Zealand).
We generated genome-wide sequence data using the 3RAD method (Bayona-Vásquez et al. 2019). This technique uses three restriction enzymes, two for the construction of dual-digest libraries and a third that cuts adapter-dimers formed by the phosphorylated adapter, thus increasing the efficiency of adapter ligation (Graham et al. 2015; Bayona-Vásquez et al. 2019). We digested 250 ng of the extracted genomic DNA for each sample using the XbaI and EcoRI-HF restriction enzymes (New England Biolabs; Beverly, MA, USA), which leave different sticky ends, and NheI (New England Biolabs; Beverly, MA, USA) to digest iTru adapter dimers. We ligated double-stranded iTru R1 and iTru R2.1 adapters onto each DNA fragment and ran a short PCR (13 -15 cycles) with the iTru5 and iTru7 primers obtained from Adapterama (Bayona-Vásquez et al. 2019). The resulting libraries were size selected in a 200-800 bp window and sequenced at the Genomic Sequencing Lab facilities at the University of California Berkeley. Libraries were sequenced using the 150 SRR HiSeq2500 Rapid, 10 pM, INDEX (124M Reads, 72% PhiX Aligned).
We used the process_radtags program implemented in the software pipeline Stacks version 2.0 (Catchen et al. 2011; Catchen et al. 2013) to demultiplex, clean, and trim the sequence data. We discarded any read with an uncalled base (-c) or with low quality scores (-q). We processed demultiplexed reads using the software pipeline ipyrad version 0.6.19 (Eaton 2014; Eaton and Overcast 2020) on the Miztli supercomputer owned by the Dirección General de Cómputo y de Tecnologías de Información y Comunicación, National Autonomous University of Mexico (DGTIC, UNAM). Reads from each sample were clustered using the program VSEARCH version 2.0.3 (https://github.com/torognes/vsearch) and aligned with the program MUSCLE version 3.8.31 (Edgar 2004).
To avoid the potential for false heterozygous calls due to clustering of paralogs (optimum clustering threshold; Eaton 2014), we followed the approach described by Ilut et al. (2014) to assess the level of sequence similarity at which two fragments are considered homologous. This approach minimizes the number of false homozygous (a single locus split into two) and false heterozygous (clustering of paralogs) loci in a clustering threshold series. We analyzed a clustering threshold series ranging from 0.80 to 0.98 in 0.02-0.03 increments. We also conducted maximum likelihood (ML) phylogenetic analyses as described below for the matrices derived from the above clustering threshold values to evaluate their level of nodal support. Based on the results obtained from these two approaches, we selected a clustering threshold value of 0.98 to build four matrices with min_samples_locus values of 25, 28, 30 and 33 including the outgroup and one with a min_sample_locus of 33 excluding it.
We generated UCE data from libraries following Branstetter et al. (2017). For this data set, we included previously generated UCE data from four male specimens (Meza-Lázaro et al. 2018). We fragmented up to 50 ng of input DNA to an average fragment distribution of 400-600 bp using a Qsonica Q800R (Qsonica LLC, Newton, CT) or a BioRuptor® Pico sonicator (Diagenode, Liége, Belgium). Following DNA fragmentation, we constructed sequencing libraries using the Kapa library preparation kit (Kapa Biosystems Inc., Wilmington, MA) and custom dual-indexing barcodes (Glenn et al. 2019). We purified PCR reactions 0.8 to 1.0X using Sera-Mag™ SpeedBeads (Thermo-Scientific, Waltham, MA, USA) (Rohland and Reich, 2012).
We pooled 10-12 libraries at equimolar concentrations for UCE enrichment, adjusting pool concentrations to 147 ng/μl. We used a total of 500 ng of DNA (3.4 μl each pool) for each enrichment. We enriched each pool using the bait set ‘ant-specific hym-v2’ (Branstetter et al. 2017), which has 9,446 custom-designed probes (MyBaits, MYcroarray, Inc., Ann Arbor, MI, USA) targeting 2,524 UCE loci and 452 baits targeting 16 commonly sequenced exons. The enriched library quality was verified using an Agilent TapeStation 2200 (Agilent Tech, Santa Clara, CA, USA). We sent pools to the University of Utah Genomics Core facility and to the Georgia Genomics Facility at the University of Georgia, where they were sequenced on an Illumina HiSeq 2500 (PE150) and an Illumina NextSeq v2300 (PE150), respectively.
Raw data were demultiplexed and converted from BCl to FASTQ by the sequencing facilities. We used the program PHYLUCE version 1.5.0 and its associated programs (Faircloth 2016) for assembly and alignment of the UCE data. We cleaned and trimmed raw reads using ILLUMIPROCESSOR (Faircloth 2013). The cleaned and trimmed reads were assembled de novo using the program ABySS version 1.3.6 (Simpson et al. 2009). We mapped the assembled contigs to the hym-v2 bait database to identify individual UCE loci, to remove paralogs and to generate a list of shared UCE loci. We sorted out data by locus and aligned each one with the program MAFFT version 7.130b (Katoh et al. 2019). The resulting alignments were filtered and trimmed with the program Gblocks version 0.91b (Castresana 2000; Talavera and Castresana 2007). We analyzed matrices with 75, 80, 90, 95 and 100% taxon occupancy (percent of taxa required to be present at each locus).
We also followed the Tutorial II: Phasing UCE data (Faircloth 2021) to call for SNPs, which is derived from the procedure described by Andermann et al. (2019). The above workflow requires an individual-specific “reference” that can be aligned against raw reads. Hymenopteran males are haploids, and thus we only expected homozygous loci for them. We used edge-trimmed exploded alignments as reference contigs and aligned raw reads to them. We exploded the edge trimmed alignments to create separate FASTA files for each sample using phyluce_align_explode_alignments. We used BWA-MEM to map the FASTQ read files against the contig reference database for each sample. We sorted the reads within each bam file into two separate bam files using phyluce_snp_phase_uces. We built three final matrices based on filtering UCE loci with 85%, 90% and 100% taxon occupancy. We also used the exploded alignment and raw reads of the 100% taxon occupancy matrix to build an additional matrix phasing the data and calling a single variant SNP per locus.