Matthew MacManes edited Recommendations.tex  almost 10 years ago

Commit id: 2981066bb47d57c690d56aa8d15d5e36b1abb97e

deletions | additions      

       

Assembly of transcriptome data is a pain in the arse... Trinity is great, but is currently constrained to use a single kmer. Trinity, by far the most popular assembler (cite assembly survey?) is an opinionated pipeline with few modifiable parameters - the underlying algorithsm have been pre-optimised to recover large numbers of alternative isoforms. In many cases, Trinity will produce an excellent assembly. However, depending on the genomic makeup of the organism being sequenced, other assemblers may perform better. Other assemblers (e.g. SOAPdenovoTrans, Velvet/Oases and TransAbySS) allow the user to select any value for k, which while increasing the time it takes to optimize assembly, may afford the ability to fine-tune the results, as well as implement a multi-kmer assembly approach. We recommend always assembling with several different strategies and compariing the results. \\  \subsection{Post-assembly transcriptome verification}  In order to compare assembly strategies, and to select a final assembly for downstream analysis, it is important to assess the quality of a transcriptome. Many authors have attempted to use typical genome assembly quality metrics for this purpose. In particular, the N50 summary statistic is often reported (e.g. \citep{Hiz:2014ep,Shinzato:2014hx,Liang:2013fm}). However, in addition to being a poor proxy for quality in genome assembly (cite), \citep{Bradnam:2013uua},  N50 in the context of a transcriptome assembly carries very little information because the optimal contig length is not known. Metrics should be chosen that optimise the assembly for the biological question at hand. In most cases, this means maximising the number of transcripts that can be confidently annotated as homologs of known genes in other organisms, while minimising the number of assembly artefacts that might cause problems downstream. Max: reference proteome coverage, min: chimeras, contigs with uncovered bases. Transrate? \subsection{Transcriptome post-processing}  discarding crap contigs, scaffolding, merging.