The proper care and feeding of your transcriptome

Richard Smith-Unna\(^{1}\), Matthew D MacManes \(^{2}\),
\(^{1}\) Department of Plant Sciences, University of Cambridge, Cambridge, United Kingdom
\(^{2}\) Department of Molecular, Cellular and Biomedical Sciences, University of New Hampshire, Durham NH, USA

\(\ast\) E-mail:, @PeroMHC

AbstractThere will be an abstract


For biologists interested in understanding the relationship between fitness, genotype, and phenotype, modern sequencing technologies provide for an unprecedented opportunity to gain a deep understanding of genome level processes that together, underlie adaptation. Transcriptome sequencing has been particularly influential, and has resulted in discoveries not possible even just a few years ago. This in large part is due to the scale at which these studies may be conducted. Unlike studies of adaptation based on one or a small number of candidate genes (e.g. (Fitzpatrick et al., 2005; Panhuis, 2006)), transctiptome studies may assay the entire suite of expressed transcripts – the transcriptome – simultaneously. In addition to issues of scale, newer sequencing studies have much more power to detect lowly expressed transcripts, or small differences in gene expression as a result of enhanced dynamic range (Wolf, 2013; Vijay et al., 2013).

As a direct result of their widespread popularity, a diverse toolset for the assembly and analysis of transcriptome exists. Notable amongst the wide array of tools include several for quality visualization (FastQC ( and SolexaQA (Cox et al., 2010)) read trimming (e.g. Trimmomatic (Bolger et al., 2014) and Cutadapt (Martin, 2011)), read normalization (khmer (Pell et al., 2012)) and error correction (Le et al., 2013), assembly (Trinity (Haas et al., 2013), SOAPdenovoTrans (Xie et al., 2014)) and assembly verificaton (transrate and RSEM-eval (Li et al., 2014)). The ease with which these tools may be used to to produce transcriptome assemblies belies that true complexity underlying the overall process. Indeed, the subtle (and not so subtle) methodological challenges associated with transcriptome reconstruction means that you can easily fuck it up. Amongst the most challenging include isoform reconstruction, simultaneous assembly of low- and high-coverage transcripts, and [] (Modrek et al., 2001; Johnson et al., 2003), which together make good transcriptome assembly really difficult. As in child rearing, production of a respectable transcriptome sequence requires a large investment in time and resources. At every step in development, care must be taken correct, but not overcorrect. Here, we propose a set of guidelines for the care and feeding that will result in the production of a well-adjusted transcriptome.
In particular, we focus here our efforts on the early development of the transcriptome, which, unfortunately are often neglected or abused. Particularly flagrant are abuses related to quality control of input data, the lack of understanding the role kmer selection may play in accurate reconstruction, and lastly later in development, abuses related to the lack of post-assembly quality evaluation. Here, we aim to define a set of evidence based analyses and methods aimed at improving transcriptome assembly, which in turn has significant effects on all downstream analyses. To accomplish the proposed standardized methods, we have released a set of version controlled open-sourced code to facilitate this process.

Methods & Results

To demonstrate the merits of our recommendations, a number of assemblies were produced using a variety of methods. Speciifically, all assembly datasets were produced by asembling a publically available 100bp paired-end Illumina dataset (Short Read Archive ID SRR797058, (Macfarlan et al., 2012)). This dataset was subsetted randomly into 10, 20, 50, 75, and 100 million read pairs as described in (MacManes, 2014). Reads were error corrected using the software packare bless version 0.16 (Heo et al., 2014) and a kmer=19, which was selected based on the developers recommendation. Illumina sequencing adapters were removed from both ends of the sequencing reads, as were nucleotides with quality Phred \(\leq\) 2, using the program Trimmmatic version 0.32 (Bolger et al., 2014). The adapter and quality trimmed, error corrected reads were then assembled using Trinity release r20140717 or SOADdenovo-Trans version 1.03. Trinity was employed using default settings, while SOAPdenovo-Trans was employed after optimizing kmer size, [and those other flags i forget right now].
To demonstrate the efficacy of using multiple assemblers, in this case Trinity and SOAPdenovo-Trans, we merged assemblies via the following process. [Richard fill in details].

For the assembly generated for the illustration of the shortcomings of length based evaluation, we generated an assembly using Trinity that employed settings purposely designed to increase the length of contigs while sacficing accuracy (–path_reinforcement_distance 1 –min_per_id_same_path 80 –max_diffs_same_path 5 –min_glue 1).
Assemblies were characterized using Transrate version 1.0.0.beta1. Using this software, we generated three kinds of metrics: contig metrics; mapping metrics which used as input the same reads that were fed into the assembler for each assembly; and comparative metrics which used as input the Mus musculus version 75 ’all’ protein file downloaded from Ensembl for all assemblies.