Building up reference ORF and UTR sets
The cDNA probes need to be sequenced to provide reference sequence sets for subsequently captured data analysis. To this end, 100 ng of the cDNA probes were used to construct a sequencing library following the same procedure as genomic library preparation. The probe library was sequenced on an Illumina HiSeq X-ten sequencer using paired-end 150-bp mode. The raw reads were first filtered to remove adapter sequences and low-quality nucleotides by using Trimmomatic version 0.36 (Bolger et al., 2014) and FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Clean reads were assembled into transcripts using TRINITY r20140717 with default parameters (Grabherr et al., 2011). The obtained transcripts were filtered with CD-HIT-EST version 4.6.5 (Fu et al., 2012) to reduce redundancy (95% similarity cutoff). The sequencing depths for filtered transcripts were calculated by SAMtools version 1.4.1 (Li et al., 2009). Only transcripts of average sequencing depth ≥ 5×, length ≥ 200 bp were retained. TransDecoder, a program in the TRINITY package, was used to determine the open reading frame (ORF) for each transcript. Based on the position of the ORF, each transcript can be annotated to 5’ UTR (untranslated region), coding region, and 3’ UTR. The translated protein sequences of the predicted ORFs were searched by BLASTP (NCBI BLAST+ version 2.6.0, Boratyn et al., 2013) against the human proteomes with an e-value threshold of 1E-10. Only transcripts that have BLASTP hits were retained to focus on known vertebrate transcripts. Finally, all ORFs of length > 300 bp and UTRs of length > 100 bp are extracted using a custom Python script to build two reference sets (ORF and UTR) for the subsequent captured data analysis (Fig. 2a).