DNA sequencing protocols and assembly procedures
All specimens were preserved in 96% ethanol until processed for DNA
sequencing. We extracted genomic DNA from whole specimens using the
EZ-10 Spin Column Genomic DNA Minipreps kit (BIO BASIC®, Toronto,
Canada) and quantified it using the Qubit fluorometer system (High
Sensitivity DNA kit, Life Technologies Inc., Carlsbad, CA, USA). We used
1:10 and 1:30 dilutions of DNA template for Sanger sequencing, and
followed the procedures described by Aguilar-Velasco et al. (2016) forcox1 amplification and sequencing. Sequences were edited and
aligned based on their translated amino acids with the program Geneious
version 10.1 (Biomatters, Ltd., Aukland, New Zealand).
We generated genome-wide sequence data using the 3RAD method
(Bayona-Vásquez et al. 2019). This technique uses three restriction
enzymes, two for the construction of dual-digest libraries and a third
that cuts adapter-dimers formed by the phosphorylated adapter, thus
increasing the efficiency of adapter ligation (Graham et al. 2015;
Bayona-Vásquez et al. 2019). We digested 250 ng of the extracted genomic
DNA for each sample using the XbaI and EcoRI-HF restriction enzymes (New
England Biolabs; Beverly, MA, USA), which leave different sticky ends,
and NheI (New England Biolabs; Beverly, MA, USA) to digest iTru adapter
dimers. We ligated double-stranded iTru R1 and iTru R2.1 adapters onto
each DNA fragment and ran a short PCR (13 -15 cycles) with the iTru5 and
iTru7 primers obtained from Adapterama (Bayona-Vásquez et al. 2019). The
resulting libraries were size selected in a 200-800 bp window and
sequenced at the Genomic Sequencing Lab facilities at the University of
California Berkeley. Libraries were sequenced using the 150 SRR
HiSeq2500 Rapid, 10 pM, INDEX (124M Reads, 72% PhiX Aligned).
We used the process_radtags program implemented in the software
pipeline Stacks version 2.0 (Catchen et al. 2011; Catchen et al. 2013)
to demultiplex, clean, and trim the sequence data. We discarded any read
with an uncalled base (-c) or with low quality scores (-q). We processed
demultiplexed reads using the software pipeline ipyrad version 0.6.19
(Eaton 2014; Eaton and Overcast 2020) on the Miztli supercomputer owned
by the Dirección General de Cómputo y de Tecnologías de Información y
Comunicación, National Autonomous University of Mexico (DGTIC, UNAM).
Reads from each sample were clustered using the program VSEARCH version
2.0.3 (https://github.com/torognes/vsearch) and aligned with the program
MUSCLE version 3.8.31 (Edgar 2004).
To avoid the potential for false heterozygous calls due to clustering of
paralogs (optimum clustering threshold; Eaton 2014), we followed the
approach described by Ilut et al. (2014) to assess the level of sequence
similarity at which two fragments are considered homologous. This
approach minimizes the number of false homozygous (a single locus split
into two) and false heterozygous (clustering of paralogs) loci in a
clustering threshold series. We analyzed a clustering threshold series
ranging from 0.80 to 0.98 in 0.02-0.03 increments. We also conducted
maximum likelihood (ML) phylogenetic analyses as described below for the
matrices derived from the above clustering threshold values to evaluate
their level of nodal support. Based on the results obtained from these
two approaches, we selected a clustering threshold value of 0.98 to
build four matrices with min_samples_locus values of 25, 28, 30 and 33
including the outgroup and one with a min_sample_locus of 33 excluding
it.
We generated UCE data from libraries following Branstetter et al.
(2017). For this data set, we included previously generated UCE data
from four male specimens (Meza-Lázaro et al. 2018). We fragmented
up to 50 ng of input DNA to an average fragment distribution of 400-600
bp using a Qsonica Q800R (Qsonica LLC, Newton, CT) or a BioRuptor® Pico
sonicator (Diagenode, Liége, Belgium). Following DNA fragmentation, we
constructed sequencing libraries using the Kapa library preparation kit
(Kapa Biosystems Inc., Wilmington, MA) and custom dual-indexing barcodes
(Glenn et al. 2019). We purified PCR reactions 0.8 to 1.0X using
Sera-Mag™ SpeedBeads (Thermo-Scientific, Waltham, MA, USA) (Rohland and
Reich, 2012).
We pooled 10-12 libraries at equimolar concentrations for UCE
enrichment, adjusting pool concentrations to 147 ng/μl. We used a total
of 500 ng of DNA (3.4 μl each pool) for each enrichment. We enriched
each pool using the bait set ‘ant-specific hym-v2’ (Branstetter et al.
2017), which has 9,446 custom-designed probes (MyBaits, MYcroarray,
Inc., Ann Arbor, MI, USA) targeting 2,524 UCE loci and 452 baits
targeting 16 commonly sequenced exons. The enriched library quality was
verified using an Agilent TapeStation 2200 (Agilent Tech, Santa Clara,
CA, USA). We sent pools to the University of Utah Genomics Core facility
and to the Georgia Genomics Facility at the University of Georgia, where
they were sequenced on an Illumina HiSeq 2500 (PE150) and an Illumina
NextSeq v2300 (PE150), respectively.
Raw data were demultiplexed and converted from BCl to FASTQ by the
sequencing facilities. We used the program PHYLUCE version 1.5.0 and its
associated programs (Faircloth 2016) for assembly and alignment of the
UCE data. We cleaned and trimmed raw reads using ILLUMIPROCESSOR
(Faircloth 2013). The cleaned and trimmed reads were assembled de novo
using the program ABySS version 1.3.6 (Simpson et al. 2009). We mapped
the assembled contigs to the hym-v2 bait database to identify individual
UCE loci, to remove paralogs and to generate a list of shared UCE loci.
We sorted out data by locus and aligned each one with the program MAFFT
version 7.130b (Katoh et al. 2019). The resulting alignments were
filtered and trimmed with the program Gblocks version 0.91b (Castresana
2000; Talavera and Castresana 2007). We analyzed matrices with 75, 80,
90, 95 and 100% taxon occupancy (percent of taxa required to be present
at each locus).
We also followed the Tutorial II: Phasing UCE data (Faircloth 2021) to
call for SNPs, which is derived from the procedure described by
Andermann et al. (2019). The above workflow requires an
individual-specific “reference” that can be aligned against raw reads.
Hymenopteran males are haploids, and thus we only expected homozygous
loci for them. We used edge-trimmed exploded alignments as reference
contigs and aligned raw reads to them. We exploded the edge trimmed
alignments to create separate FASTA files for each sample using
phyluce_align_explode_alignments. We used BWA-MEM to map the FASTQ
read files against the contig reference database for each sample. We
sorted the reads within each bam file into two separate bam files using
phyluce_snp_phase_uces. We built three final matrices based on
filtering UCE loci with 85%, 90% and 100% taxon occupancy. We also
used the exploded alignment and raw reads of the 100% taxon occupancy
matrix to build an additional matrix phasing the data and calling a
single variant SNP per locus.