Building up reference ORF and UTR sets
The
cDNA probes need to be sequenced to provide reference sequence sets for
subsequently captured data analysis. To this end, 100 ng of the cDNA
probes were used to construct a sequencing library following the same
procedure as genomic library preparation. The probe library was
sequenced on an Illumina HiSeq X-ten sequencer using paired-end 150-bp
mode. The raw reads were first filtered to remove adapter sequences and
low-quality nucleotides by using Trimmomatic version 0.36 (Bolger et
al., 2014) and FastQC
(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Clean reads
were assembled into transcripts using TRINITY r20140717 with default
parameters (Grabherr et al., 2011). The obtained transcripts were
filtered with CD-HIT-EST version 4.6.5 (Fu et al., 2012) to reduce
redundancy (95% similarity cutoff). The sequencing depths for filtered
transcripts were calculated by SAMtools version 1.4.1 (Li et al., 2009).
Only transcripts of average sequencing depth ≥ 5×, length ≥ 200 bp were
retained. TransDecoder, a program in the TRINITY package, was used to
determine the open reading frame (ORF) for each transcript. Based on the
position of the ORF, each transcript can be annotated to 5’ UTR
(untranslated region), coding region, and 3’ UTR. The translated protein
sequences of the predicted ORFs were searched by BLASTP (NCBI BLAST+
version 2.6.0, Boratyn et al., 2013) against the human proteomes with an
e-value threshold of 1E-10. Only transcripts that have BLASTP hits were
retained to focus on known vertebrate transcripts.
Finally,
all ORFs of length > 300 bp and UTRs of length
> 100 bp are extracted using a custom Python script to
build two reference sets (ORF and UTR) for the subsequent captured data
analysis (Fig. 2a).