Sequence capture data analysis
Sequence capture reads were sorted into each species by the 8-bp species
index. The raw reads of each species were filtered to remove adapter
sequences and low-quality nucleotides. To accelerate assembly and save
computing resources, we down-sampled reads over high-depth areas at an
average depth of 20× by normalization using BBNORM.SH (BBTools, Bushnell
2014). The normalized read data were then de novo assembled using
the SPAdes version 3.8.1 genome assembler (Bankevich et al., 2012),
using an auto K-mer mode (–cov-cutoff auto). Only contigs longer than
200 bp were retained. The retained contigs were further filtered with
CD-HIT-EST to reduce redundancy (95% similarity
cutoff). Finally, the contigs were
BLASTed against the reference ORF and UTR sets to remove non-target
sequences, thus reduce the computational intensity of the subsequent
analyses.
Extract ORF (coding)
sequences —Each of the reference
ORF sequences is typically composed of multiple exons. In the genome,
these exons are interrupted by introns. To extract the coding sequence
corresponding to the entire reference ORF, we need to identify correct
exons from assembled contigs and stitch these identified exons into a
complete ORF sequence. We adopted a bioinformatics pipeline called ”exon
mapping” to fulfill this purpose. For each reference ORF,
we used EXONERATE version 2.2.0
(Slater & Birney 2005) to locate its relevant exons from the filtered
contigs based on its translated protein sequence. The identified exons
are then mapped onto the reference ORF to make an orthologous ORF
sequence (missing regions were filled with N), based on the coordinate
information from the EXONERATE searches. This exon mapping strategy
makes the obtained coding sequences from each sample have the same
length to the reference ORF, which reduces the difficulty of sequence
aligning. The workflow of extracting the orthologous ORF sequence is
illustrated in Figure 2b. The whole bioinformatics procedure is
fulfilled by an in-house Python script ”Extracting coding sequences.py.”
Extract
UTR (noncoding) sequences — Unlike the ORF sequences are fractured in
the genome, UTR sequences are commonly continuous in the genome and most
likely located within single assembled contigs. Based on the reference
UTRs, we used a mutual best-hit (MBH) strategy (program = BLASTN,
e-value < 1e−10, identity > 70%) to extract UTR
orthologous groups (OGs) from all samples. The 1:1 orthology is
confirmed if one contig of a sampled species and one reference UTR
sequence find each other as the best hit in the bidirectional BLAST.
Within each UTR OGs, the extracted sequences are normally different in
length, which may make them difficult to be aligned. We used a
previously published Python script (Li et al., 2019) to determine the
optimal aligning region for all sequences within a UTR OG. In brief, the
script uses the mutual-BLAST results to determine the relative position
of each sequence to the reference sequence and searches for the optimal
upstream and downstream boundaries to trim the sequences. Here, we
demand that at least 40% of species have data at both the upstream and
downstream boundaries. The workflow of extracting the orthologous UTR
sequence is illustrated in Figure 2c.