Authorea

Richard Smith-Unna edited Recommendations.tex almost 10 years ago

Commit id: 91b444ed348e0fbbf839ab7051a80c72135b842b

deletions | additions

\section*{Recommendations} \subsection{Input Data}: Data} When planning to construct a transcriptome, the first question to ponder is the type and quantity of data required. While this will be somewhat determined by the specific goals of the study and availability of tissues, there are some general guiding principals. As of 2014, Illumina continues to offer the most flexibility in terms of throughout, analytical tractability, and cost. It is worth noting however, that long-read (e.g. PacBio) transcriptome sequencing is just beginning to emerge as an alternative \citep{Au:2013hp}, particularly for researchers interested in understanding isoform complexity. \\ For the typical transcriptome study, one should plan to generate a reference based on 1 or more tissue types. From each tissue, one should be generating between 50M and 100M strand-specific paired-end reads. Read length should be at least 100bp, with longer reads aiding in isoform reconstruction and contiguity. Because sequence polymorphism increases the complexity of the \textit{de bruijn} graph, and therefore may negatively effect the assembly itself, the reference transcriptome should be generated from reads corresponding to as homogeneous a sample as possible. For non-model organisms, this usually means a single individual. When more then one individual is required to meet other requirements (e.g. number of reads), keeping the number of individuals to a minimum is paramount. \\ \subsection{Quality Control of Sequence Read Data}: Data} Before assembly, it is critical that appropriate quality control steps are implemented. It is often helpful to generate some metrics of read quality on the raw data. Though this step may well be fairly unrepresentative of the true dataset quality, it is often informative and instructive. Several software packages are available -- we are fond of SolexaQA and FastQC. These raw reads should be copied, compressed, and archived. \\ After visualizing the raw data, a vigorous adapter trimming step is implemented, typically using Trimmomatic. With adapter trimming may be a quality trimming step, though caution is required, as aggressive trimming may have detrimental effects on assembly quality. Specifically, we recommend trimming at Phred=2, a threshold associated with removal of the lowest quality bases. After adapter and quality trimming, it is recommended to once again visualize the data using SolexaQC. The .gz compressed reads are now ready for assembly. \\ \subsection{Assembly}: \subsection{Assembly} Assembly of transcriptome data is a pain in the arse... Trinity is great, but is currently constrained to use a single kmer. Trinity, by far the most popular assembler (cite assembly survey?) is an opinionated pipeline with few modifiable parameters - the underlying algorithsm have been pre-optimised to recover large numbers of alternative isoforms. In many cases, Trinity will produce an excellent assembly. However, depending on the genomic makeup of the organism being sequenced, other assemblers may perform better. Other assemblers (e.g. SOAPdenovoTrans, Velvet/Oases and TransAbySS) allow the user to select any value for k, which while increasing the time it takes to optimize assembly, may afford the ability to fine-tune the results, as well as implement a multi-kmer assembly approach. We recommend always assembling with several different strategies and compariing the results. \\ \subsection{Post-assembly transcriptome verification}: verification} In order to compare assembly strategies, and to select a final assembly for downstream analysis, it is important to assess the quality of a transcriptome. Many authors have attempted to use typical genome assembly quality metrics for this purpose. In particular, the N50 summary statistic is often reported (cite). However, in addition to being a poor proxy for quality in genome assembly (cite), N50 in the context of a transcriptome assembly carries very little information because the optimal contig length is not known. Metrics should be chosen that optimise the assembly for the biological question at hand. In most cases, this means maximising the number of transcripts that can be confidently annotated as homologs of known genes in other organisms, while minimising the number of assembly artefacts that might cause problems downstream. Max: reference proteome coverage, min: chimeras, contigs with uncovered bases. Transrate? \subsection{Transcriptome post-processing}: post-processing} discarding crap contigs, scaffolding, merging.