The proper care and feeding of your transcriptome

Richard Smith-Unna\(^{1}\), Matthew D MacManes \(^{2}\),
\(^{1}\) Department of Plant Sciences, University of Cambridge, Cambridge, United Kingdom
\(^{2}\) Department of Molecular, Cellular and Biomedical Sciences, University of New Hampshire, Durham NH, USA

\(\ast\) E-mail:, @PeroMHC


There will be an abstract


For biologists interested in understanding the relationship between fitness, genotype, and phenotype, modern sequencing technologies provide for an unprecedented opportunity to gain a deep understanding of genome level processes that together, underlie adaptation. Transcriptome sequencing has been particularly influential, and has resulted in discoveries not possible even just a few years ago. This in large part is due to the scale at which these studies may be conducted. Unlike studies of adaptation based on one or a small number of candidate genes (e.g. (Fitzpatrick et al., 2005; Panhuis, 2006)), transctiptome studies may assay the entire suite of expressed transcripts – the transcriptome – simultaneously. In addition to issues of scale, newer sequencing studies have much more power to detect lowly expressed transcripts, or small differences in gene expression as a result of enhanced dynamic range (Wolf, 2013; Vijay et al., 2013).

As a direct result of their widespread popularity, a diverse toolset for the assembly and analysis of transcriptome exists. Notable amongst the wide array of tools include several for quality visualization (FastQC ( and SolexaQA (Cox et al., 2010)) read trimming (e.g. Trimmomatic (Bolger et al., 2014) and Cutadapt (Martin, 2011)), read normalization (khmer (Pell et al., 2012)) and error correction (Le et al., 2013), assembly (Trinity (Haas et al., 2013), SOAPdenovoTrans (Xie et al., 2014)) and assembly verificaton (transrate and RSEM-eval (Li et al., 2014)). The ease with which these tools may be used to to produce transcriptome assemblies belies that true complexity underlying the overall process. Indeed, the subtle (and not so subtle) methodological challenges associated with transcriptome reconstruction means that you can easily fuck it up. Amongst the most challenging include isoform reconstruction, simultaneous assembly of low- and high-coverage transcripts, and [] (Modrek et al., 2001; Johnson et al., 2003), which together make good transcriptome assembly really difficult. As in child rearing, production of a respectable transcriptome sequence requires a large investment in time and resources. At every step in development, care must be taken correct, but not overcorrect. Here, we propose a set of guidelines for the care and feeding that will result in the production of a well-adjusted transcriptome.
In particular, we focus here our efforts on the early development of the transcriptome, which, unfortunately are often neglected or abused. Particularly flagrant are abuses related to quality control of input data, the lack of understanding the role kmer selection may play in accurate reconstruction, and lastly later in development, abuses related to the lack of post-assembly quality evaluation. Here, we aim to define a set of evidence based analyses and methods aimed at improving transcriptome assembly, which in turn has significant effects on all downstream analyses. To accomplish the proposed standardized methods, we have released a set of version controlled open-sourced code to facilitate this process.

Methods & Results

To demonstrate the merits of our recommendations, a number of assemblies were produced using a variety of methods. Speciifically, all assembly datasets were produced by asembling a publically available 100bp paired-end Illumina dataset (Short Read Archive ID SRR797058, (Macfarlan et al., 2012)). This dataset was subsetted randomly into 10, 20, 50, 75, and 100 million read pairs as described in (MacManes, 2014). Reads were error corrected using the software packare bless version 0.16 (Heo et al., 2014) and a kmer=19, which was selected based on the developers recommendation. Illumina sequencing adapters were removed from both ends of the sequencing reads, as were nucleotides with quality Phred \(\leq\) 2, using the program Trimmmatic version 0.32 (Bolger et al., 2014). The adapter and quality trimmed, error corrected reads were then assembled using Trinity release r20140717 or SOADdenovo-Trans version 1.03. Trinity was employed using default settings, while SOAPdenovo-Trans was employed after optimizing kmer size, [and those other flags i forget right now].
To demonstrate the efficacy of using multiple assemblers, in this case Trinity and SOAPdenovo-Trans, we merged assemblies via the following process. [Richard fill in details].

For the assembly generated for the illustration of the shortcomings of length based evaluation, we generated an assembly using Trinity that employed settings purposely designed to increase the length of contigs while sacficing accuracy (–path_reinforcement_distance 1 –min_per_id_same_path 80 –max_diffs_same_path 5 –min_glue 1).
Assemblies were characterized using Transrate version 1.0.0.beta1. Using this software, we generated three kinds of metrics: contig metrics; mapping metrics which used as input the same reads that were fed into the assembler for each assembly; and comparative metrics which used as input the Mus musculus version 75 ’all’ protein file downloaded from Ensembl for all assemblies.


Input Data

Summary Statement: Sequence 1 or more tissues from 1 individial to a depth of between 50 million and 100million 100bp paired-end reads.

When planning to construct a transcriptome, the first question to ponder is the type and quantity of data required. While this will be somewhat determined by the specific goals of the study and availability of tissues, there are some general guiding principals. As of 2014, Illumina continues to offer the most flexibility in terms of throughout, analytical tractability, and cost ((Glenn, 2011) and It is worth noting however, that long-read (e.g. PacBio) transcriptome sequencing is just beginning to emerge as an alternative (Au et al., 2013), particularly for researchers interested in understanding isoform complexity.
For the typical transcriptome study, one should plan to generate a reference based on 1 or more tissue types, with each tissue adding unique tissue-specific transcripts and isoforms. Because with added sequencing coverage comes a more accurate and representative assembly (Figure 1), one should be generating between 50M and 100M strand-specific paired-end reads, which appears to represent a good balance between cost and quality. Read length should be at least 100bp, with longer reads aiding in isoform reconstruction and contiguity (Garber et al., 2011). Because sequence polymorphism increases the complexity of the de bruijn graph (Iqbal et al., 2012; Studholme, 2010), and therefore may negatively effect the assembly itself, the reference transcriptome should be generated from reads corresponding to as homogeneous a sample as possible. For non-model organisms, this usually means a single individual. When more then one individual is required to meet other requirements (e.g. number of reads), keeping the number of individuals to a minimum is paramount.

Quality Control of Sequence Read Data

Summary Statement:Visualize your read data, Error correct reads, remove adapters, and employ gentle quality filtering

Before assembly, it is critical that appropriate quality control steps are implemented. It is often helpful to generate some metrics of read quality on the raw data. Though this step may well be fairly unrepresentative of the true dataset quality, it is often informative and instructive. Several software packages are available – we are fond of SolexaQA and FastQC. These raw reads should be copied, compressed, and archived.
After visualizing the raw data, error correction of the sequencing reads should be done (MacManes et al., 2013) using any of the available error correctors, though we have had success with both bless and BayesHAMMER (Nikolenko et al., 2013). The error corrected reads are then subjected to a vigorous adapter trimming step is implemented, typically using Trimmomatic. With adapter trimming may be a quality trimming step, though caution is required, as aggressive trimming may have detrimental effects on assembly quality. Specifically, we recommend trimming at Phred=2 (MacManes, 2014), a threshold associated with removal of the lowest quality bases. After adapter and quality trimming, it is recommended to once again visualize the data using SolexaQC. The .gz compressed reads are now ready for assembly.

Coverage normalisation

Summary Statement: Normalize your data, only if you have to.

Depending on the volumn of input data, the availability of a high-memory workstation, and the rapidity with which the assembly is needed, coverage normalization may be employed. This process, which [fill in some details about the specifics of the method], aims to erode areas of high coverage while leaving untouched reads spanning lower coverage areas, this reducung mean read coverage to a user specified level (typically 30-50X). This process, whose primary job is to redu