Bait set design and synthesis
All downloaded transcriptomes were assembled de novo using the pipeline from Cunha & Giribet (2019). Briefly, quality threshold filtering was conducted with Rcorrector v. 3.0 (Song & Florea, 2015) and Trim Galore! V. 3 (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). rRNA and mitochondrial unwanted sequences for molluscs were filtered out using Bowtie2 v. 2.3.2 (Langmead & Salzberg, 2012). Paired-end reads were de novo assembled into transcripts with Trinity v. 2.4 (Grabherr et al., 2010; Haas et al., 2013). A second Bowtie2 round and CD-HIT-EST v. 4.6.4 were used to reduce sequence redundancy (Fu, Niu, Zhu, Wu, & Li, 2012). The software PHYLUCE (Faircloth, 2016) was then used to identify UCE loci and design baits to target them using the online tutorial (https://phyluce.readthedocs.io/en/latest/tutorial-four.html). Downloaded FASTA files for the selected genomes/transcriptomes were reformatted into 2bit using faToTwoBit and headers were modified using Bio.SeqIO (Grüning et al., 2018) for compatibility with PHYLUCE.ART (Huang, Li, Myers, & Marth, 2012), which was used to simulate reads of 100 bp in length, covering the genome randomly to roughly 2X, and having an insert size of 200 bp (150 SD), for each species. These were individually mapped into putatively orthologous loci with a sequence divergence of < 5% from our base genome (A. californica ) using stampy v. 1 (Lunter & Goodson, 2011) and unmapped reads were removed with SAMtools v. 1.5 (Li et al., 2009). BEDTools (Quinlan & Hall, 2010) was used to convert BAM files, sort the contigs by scaffold and position, and merge them in putative conserved regions. Intervals where the base genome was shorter than 80 bp and where > 25 % of the base genome was masked (i.e., repetitive regions) were deleted in PHYLUCE.
An SQLite table was created to query for conserved loci across taxa with an optimal number of four out of five taxa, resulting in a total of 7,222 shared loci. Temporary baits were designed to capture loci shared among the base genome and the exemplar taxa, buffering to 160 bp to ensure designing two 120mers per locus with 3x tiling density, removing potentially problematic baits with >25% masking and GC content outside of a 30–70% range. Finally, potential duplicates of >50% identity and coverage were parsed and removed. In order to include baits designed from the base genome and the exemplar taxa, the temporary baits were also aligned against all five exemplar taxa and conserved loci were extracted as FASTA files. An additional SQLite table was created to check for the loci found consistently across taxa.
We finally decided to target loci that were shared among five out of the six taxa, totalling 2,320. Final bait design was performed using the abovementioned steps but using both the base genome and the rest of the exemplar taxa. A subset of locus bait set targeting only the specific heterobranch species (excluding the caenogastropod P. canaliculata ) was designed using phyluce_probe_get_subsets_of_tiled_probes. The final set contained 19,333 baits and targeted 2,259 loci across Tectipleura. In silico tests of the UCE bait set against de novo assembled transcriptomes were performed against a wide range of taxa belonging to Heterobranchia and two Caenogastropoda outgroups usingphyluce_assembly_match_contigs_to_probes (see Table 1).
In order to synthesize the designed bait set, each bait candidate was BLASTed against the base genome in order to filter non-specific or over-represented regions and a hybridization melting temperature (defined as the temperature at which 50% of molecules are hybridized) was estimated for each hit assuming standard myBaits® (Arbor Biosciences, MI, USA) buffers and conditions. There were 129 baits that matched a portion of the genome that was >25% soft masked for repeats and 5 baits failed out Moderate BLAST analysis (candidates pass if they have at most 10 hits at 62.5–65 °C and 2 hits above 65 °C, and fewer than 2 passing baits on each flank), indicating they had multiple hits to the genome at the hybridization temperature and, thus, were removed. Due to technical difficulties in synthesizing the 120mer set, three overlapping 70mers for each 120mer were designed (1 bait every 25 nt), both providing the same coverage of the original design targets with the same capture efficiency as the 120mer design. Interestingly, shorter fragments have the ability to be used at a range of hybridization temperatures, thus, being more effective on degraded museum samples. The total design had 57,606 baits (out of the original 19,202 120mer set).