Authorea

Matthew MacManes edited Recommendations.tex almost 10 years ago

Commit id: f06fe600df4b45383d79c2ec6d40f7e99b6654f5

deletions | additions

\subsection{Input Data} When planning to construct a transcriptome, the first question to ponder is the type and quantity of data required. While this will be somewhat determined by the specific goals of the study and availability of tissues, there are some general guiding principals. As of 2014, Illumina continues to offer the most flexibility in terms of throughout, analytical tractability, and cost \citep{Glenn:2011gy}. It is worth noting however, that long-read (e.g. PacBio) transcriptome sequencing is just beginning to emerge as an alternative \citep{Au:2013hp}, particularly for researchers interested in understanding isoform complexity. \\ For the typical transcriptome study, one should plan to generate a reference based on 1 or more tissue types. From types, with each tissue, tissue adding unique tissue-specific transcripts and isoforms. Because with added sequencing coverage comes a more accurate and representative assembly (Figure 1), one should be generating between 50M and 100M strand-specific paired-end reads. reads, which appears to represent a good balance between cost and quality. Read length should be at least 100bp, with longer reads aiding in isoform reconstruction and contiguity \citep{Garber:2011gp}. Because sequence polymorphism increases the complexity of the \textit{de bruijn} graph, graph \citep{Iqbal:2012fx,Paszkiewicz:2010dla}, and therefore may negatively effect the assembly itself, the reference transcriptome should be generated from reads corresponding to as homogeneous a sample as possible. For non-model organisms, this usually means a single individual. When more then one individual is required to meet other requirements (e.g. number of reads), keeping the number of individuals to a minimum is paramount. \\ \subsection{Quality Control of Sequence Read Data} Before assembly, it is critical that appropriate quality control steps are implemented. It is often helpful to generate some metrics of read quality on the raw data. Though this step may well be fairly unrepresentative of the true dataset quality, it is often informative and instructive. Several software packages are available -- we are fond of SolexaQA and FastQC. These raw reads should be copied, compressed, and archived. \\