Authorea

Matt MacManes Merge branch 'master' of https://github.com/macmanes-lab/feeding_transcriptomes almost 10 years ago

Commit id: 52ed54d226342eb7f7deff662cc347b3367397c7

deletions | additions

\subsection{Input Data} When planning to construct a transcriptome, the first question to ponder is the type and quantity of data required. While this will be somewhat determined by the specific goals of the study and availability of tissues, there are some general guiding principals. As of 2014, Illumina continues to offer the most flexibility in terms of throughout, analytical tractability, and cost \citep{Glenn:2011gy}. It is worth noting however, that long-read (e.g. PacBio) transcriptome sequencing is just beginning to emerge as an alternative \citep{Au:2013hp}, particularly for researchers interested in understanding isoform complexity. \\ For the typical transcriptome study, one should plan to generate a reference based on 1 or more tissue types. From types, with each tissue, tissue adding unique tissue-specific transcripts and isoforms. Because with added sequencing coverage comes a more accurate and representative assembly (Figure 1), one should be generating between 50M and 100M strand-specific paired-end reads. reads, which appears to represent a good balance between cost and quality. Read length should be at least 100bp, with longer reads aiding in isoform reconstruction and contiguity \citep{Garber:2011gp}. Because sequence polymorphism increases the complexity of the \textit{de bruijn} graph, graph \citep{Iqbal:2012fx,Paszkiewicz:2010dla}, and therefore may negatively effect the assembly itself, the reference transcriptome should be generated from reads corresponding to as homogeneous a sample as possible. For non-model organisms, this usually means a single individual. When more then one individual is required to meet other requirements (e.g. number of reads), keeping the number of individuals to a minimum is paramount. \\ \subsection{Quality Control of Sequence Read Data} Before assembly, it is critical that appropriate quality control steps are implemented. It is often helpful to generate some metrics of read quality on the raw data. Though this step may well be fairly unrepresentative of the true dataset quality, it is often informative and instructive. Several software packages are available -- we are fond of SolexaQA and FastQC. These raw reads should be copied, compressed, and archived. \\

abstract.tex sectionIntroduction_.tex methods&results.tex Recommendations.tex sectionAcknowledgmen.tex sectionFigures.tex

\section*{Methods & Results} To demonstrate the merits of our recommendations, a number of assemblies were produced using a variety of methods. These assemblies were evaluated using methods descripbed elsewhere. All assembly datasets were produced by asembling a publically available 100bp paired-end Illumina dataset (Short Read Archive ID SRR797058, \citep{Macfarlan:2012js}). This dataset was subsetted randomly into 10, 20, 50, 75, and 100 million read pairs as described in \citep{MacManes:2014io}. Adapters were removed from the reads, as were nucleotides with quality Phred \leq 2, using the program Trimmmatic version 0.32 \citep{Bolger:2014ek}. Read error correction, when employed, was done using Seecer version 0.1.3 \citep{Le:2013dy}. The adapter and quality trimmed, error corrected reads were then assembled using Trinity release r0140717 or SOADdenovo-Trans version 1.03. Trinity was employed using default settings, while SOAPdenovo-Trans was employed after optimizing kmer size, [and those other flags i forget right now]. \\ All assemblies were characterized using Transrate version 0.31.