Bait set design and synthesis
All downloaded transcriptomes were assembled de novo using the
pipeline from Cunha & Giribet (2019). Briefly, quality threshold
filtering was conducted with Rcorrector v. 3.0 (Song & Florea, 2015)
and Trim Galore! V. 3
(http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/).
rRNA and mitochondrial unwanted sequences for molluscs were filtered out
using Bowtie2 v. 2.3.2 (Langmead & Salzberg, 2012). Paired-end reads
were de novo assembled into transcripts with Trinity v. 2.4
(Grabherr et al., 2010; Haas et al., 2013). A second Bowtie2 round and
CD-HIT-EST v. 4.6.4 were used to reduce sequence redundancy (Fu, Niu,
Zhu, Wu, & Li, 2012). The software PHYLUCE (Faircloth, 2016) was then
used to identify UCE loci and design baits to target them using the
online tutorial
(https://phyluce.readthedocs.io/en/latest/tutorial-four.html).
Downloaded FASTA files for the selected genomes/transcriptomes were
reformatted into 2bit using faToTwoBit and headers were modified using
Bio.SeqIO (Grüning et al., 2018) for compatibility with PHYLUCE.ART
(Huang, Li, Myers, & Marth, 2012), which was used to simulate reads of
100 bp in length, covering the genome randomly to roughly 2X, and having
an insert size of 200 bp (150 SD), for each species. These were
individually mapped into putatively orthologous loci with a sequence
divergence of < 5% from our base genome (A.
californica ) using stampy v. 1 (Lunter & Goodson, 2011) and unmapped
reads were removed with SAMtools v. 1.5 (Li et al., 2009). BEDTools
(Quinlan & Hall, 2010) was used to convert BAM files, sort the contigs
by scaffold and position, and merge them in putative conserved regions.
Intervals where the base genome was shorter than 80 bp and where
> 25 % of the base genome was masked (i.e., repetitive
regions) were deleted in PHYLUCE.
An SQLite table was created to query for conserved loci across taxa with
an optimal number of four out of five taxa, resulting in a total of
7,222 shared loci. Temporary baits were designed to capture loci shared
among the base genome and the exemplar taxa, buffering to 160 bp to
ensure designing two 120mers per locus with 3x tiling density, removing
potentially problematic baits with >25% masking and GC
content outside of a 30–70% range. Finally, potential duplicates of
>50% identity and coverage were parsed and removed. In
order to include baits designed from the base genome and the exemplar
taxa, the temporary baits were also aligned against all five exemplar
taxa and conserved loci were extracted as FASTA files. An additional
SQLite table was created to check for the loci found consistently across
taxa.
We finally decided to target loci that were shared among five out of the
six taxa, totalling 2,320. Final bait design was performed using the
abovementioned steps but using both the base genome and the rest of the
exemplar taxa. A subset of locus bait set targeting only the specific
heterobranch species (excluding the caenogastropod P.
canaliculata ) was designed using
phyluce_probe_get_subsets_of_tiled_probes. The final set contained
19,333 baits and targeted 2,259 loci across Tectipleura. In
silico tests of the UCE bait set against de novo assembled
transcriptomes were performed against a wide range of taxa belonging to
Heterobranchia and two Caenogastropoda outgroups usingphyluce_assembly_match_contigs_to_probes (see Table 1).
In order to synthesize the designed bait set, each bait candidate was
BLASTed against the base genome in order to filter non-specific or
over-represented regions and a hybridization melting temperature
(defined as the temperature at which 50% of molecules are hybridized)
was estimated for each hit assuming standard myBaits® (Arbor
Biosciences, MI, USA) buffers and conditions. There were 129 baits that
matched a portion of the genome that was >25% soft masked
for repeats and 5 baits failed out Moderate BLAST analysis (candidates
pass if they have at most 10 hits at 62.5–65 °C and 2 hits above 65 °C,
and fewer than 2 passing baits on each flank), indicating they had
multiple hits to the genome at the hybridization temperature and, thus,
were removed. Due to technical difficulties in synthesizing the 120mer
set, three overlapping 70mers for each 120mer were designed (1 bait
every 25 nt), both providing the same coverage of the original design
targets with the same capture efficiency as the 120mer design.
Interestingly, shorter fragments have the ability to be used at a range
of hybridization temperatures, thus, being more effective on degraded
museum samples. The total design had 57,606 baits (out of the original
19,202 120mer set).