The proper care and feeding of your transcriptome
Richard Smith-Unna\(^{1}\), Matthew D MacManes \(^{3}\),
\(^{1}\) University of Cambridge
\(^{3}\) Department of Molecular, Cellular and Biomedical Sciences, University of New Hampshire, Durham NH, USA

\(\ast\) E-mail: [email protected], @PeroMHC

Abstract

Some abstract

Introduction

For biologists interested in understanding the relationship between fitness, genotype, and phenotype, modern sequencing technologies provide for an unprecedented opportunity to gain a deep understanding of genome level processes that together, underlie adaptation. Transcriptome sequencing has been particularly influential, and as a direct result, a diverse toolset for the assembly and analysis of transcriptome exists. Notable amongst the wide array of tools include several for quality visualization (FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and SolexaQA \citep{Cox:2010ch}) read trimming (e.g. Trimmomatic \citep{Bolger:2014ek} and Cutadapt \citep{Martin:2011va}), read normalization (khmer \citep{Pell:2012id}), assembly (Trinity \citep{Haas:2013jq}, SOAPdenovoTrans \citep{Xie:2013wu}) and assembly verificaton (transrate https://github.com/Blahah/transrate and RSEM-eval \citep{Li:2014er}).
The ease with which these tools may be used to to produce transcriptome assemblies belies that true complexity underlying the overall process. Indeed, the subtle (and not so subtle) methodological challenges associated with transcriptome reconstruction means that you can easily fuck it up. Amongst the most challenging include isoform reconstruction, simultaneous assembly of low- and high-coverage transcripts, and [] \citep{Modrek:2001ud,Johnson:2003kh}, which together make good transcriptome assembly really difficult.
Methodological abuse is widespread. Particularly flagrant are abuses related to quality control of input data, the lack of understanding the role kmer selection may play in accurate reconstruction, and lastly, abuses related to the lack of post-assembly quality evaluation. Here, we aim to define a set of evidence based analyses and methods aimed at improving transcriptome assembly, which in turn has significant effects on all downstream analyses.

To accomplish the proposed standardized methods, we have released a set of version controlled open-sourced code to facilitate this process.

Recommendations

Input Data: When planning to construct a transcriptome, the first question to ponder is the type and quantity of data required. While this will be somewhat determined by the specific goals of the study and availability of tissues, there are some general guiding principals. As of 2014, Illumina continues to offer the most flexibility in terms of throughout, analytical tractability, and cost. It is worth noting however, that long-read (e.g. PacBio) transcriptome sequencing is just beginning to emerge as an alternative \citep{Au:2013hp}, particularly for researchers interested in understanding isoform complexity.
For the typical transcriptome study, one should plan to generate a reference based on 1 or more tissue types. From each tissue, one should be generating between 50M and 100M strand-specific paired-end reads. Read length should be at least 100bp, with longer reads aiding in isoform reconstruction and contiguity. Because sequence polymorphism increases the complexity of the de bruijn graph, and therefore may negatively effect the assembly itself, the reference transcriptome should be generated from reads corresponding to a single individual. When more then one individual is required to meet other requirements (e.g. number of reads), keeping the number of individuals to a minimum is paramount.
Quality Control of Sequence Read Data: Before assembly, it is critical that appropriate quality control steps are implemented. It is often helpful to generate some metrics of read quality on the raw data. Though this step may well be fairly unrepresentative of the true dataset quality, it is often informative and instructive. Several software packages are available– we are fond of SolexaQA and FastQC. These raw reads should be copied, compressed, and archived.
After visualizing the raw data, a vigorous adapter trimming step is implemented, typically using Trimmomatic. With adapter trimming may be a quality trimming step, though caution is required, as aggressive trimming may have detrimental effects on assembly quality. Specifically, we recommend trimming at Phred=2, a threshold associated with removal of the lowest quality bases. After adapter and quality trimming, it is recommended to once again visualize the data using SolexaQC. The .gz compressed reads are now ready for assembly.
Assembly: Assembly of transcriptome data is a ... Trinity is great, but is currently constrained to use a single kmer. In contrast, other assemblers (e.g. SOAPdenoveTrans) allows the used to select any value for k, which while increasing the time it takes to optimize assembly, may afford the ability to fine-tune the results, as well as implement a multi-kmer assembly approach.
Post-assembly transcriptome verification: Basically die N50, focus on functional metrics, transrate, etc..

Acknowledgments

Figures

Tables