Sequence capture data analysis
Sequence capture reads were sorted into each species by the 8-bp species index. The raw reads of each species were filtered to remove adapter sequences and low-quality nucleotides. To accelerate assembly and save computing resources, we down-sampled reads over high-depth areas at an average depth of 20× by normalization using BBNORM.SH (BBTools, Bushnell 2014). The normalized read data were then de novo assembled using the SPAdes version 3.8.1 genome assembler (Bankevich et al., 2012), using an auto K-mer mode (–cov-cutoff auto). Only contigs longer than 200 bp were retained. The retained contigs were further filtered with CD-HIT-EST to reduce redundancy (95% similarity cutoff). Finally, the contigs were BLASTed against the reference ORF and UTR sets to remove non-target sequences, thus reduce the computational intensity of the subsequent analyses.
Extract ORF (coding) sequences —Each of the reference ORF sequences is typically composed of multiple exons. In the genome, these exons are interrupted by introns. To extract the coding sequence corresponding to the entire reference ORF, we need to identify correct exons from assembled contigs and stitch these identified exons into a complete ORF sequence. We adopted a bioinformatics pipeline called ”exon mapping” to fulfill this purpose. For each reference ORF, we used EXONERATE version 2.2.0 (Slater & Birney 2005) to locate its relevant exons from the filtered contigs based on its translated protein sequence. The identified exons are then mapped onto the reference ORF to make an orthologous ORF sequence (missing regions were filled with N), based on the coordinate information from the EXONERATE searches. This exon mapping strategy makes the obtained coding sequences from each sample have the same length to the reference ORF, which reduces the difficulty of sequence aligning. The workflow of extracting the orthologous ORF sequence is illustrated in Figure 2b. The whole bioinformatics procedure is fulfilled by an in-house Python script ”Extracting coding sequences.py.”
Extract UTR (noncoding) sequences — Unlike the ORF sequences are fractured in the genome, UTR sequences are commonly continuous in the genome and most likely located within single assembled contigs. Based on the reference UTRs, we used a mutual best-hit (MBH) strategy (program = BLASTN, e-value < 1e−10, identity > 70%) to extract UTR orthologous groups (OGs) from all samples. The 1:1 orthology is confirmed if one contig of a sampled species and one reference UTR sequence find each other as the best hit in the bidirectional BLAST. Within each UTR OGs, the extracted sequences are normally different in length, which may make them difficult to be aligned. We used a previously published Python script (Li et al., 2019) to determine the optimal aligning region for all sequences within a UTR OG. In brief, the script uses the mutual-BLAST results to determine the relative position of each sequence to the reference sequence and searches for the optimal upstream and downstream boundaries to trim the sequences. Here, we demand that at least 40% of species have data at both the upstream and downstream boundaries. The workflow of extracting the orthologous UTR sequence is illustrated in Figure 2c.