Introduction

For biologists interested in understanding the relationship between fitness, genotype, and phenotype, modern sequencing technologies provide for an unprecedented opportunity to gain a deep understanding of genome level processes that together, underlie adaptation. Transcriptome sequencing has been particularly influential, and has resulted in discoveries not possible even just a few years ago. This in large part is due to the scale at which these studies may be conducted. Unlike studies of adaptation based on one or a small number of candidate genes (e.g. \citep{Fitzpatrick:2005vd,Panhuis:2006kp}), transctiptome studies may assay the entire suite of expressed transcripts – the transcriptome – simultaneously. In addition to issues of scale, newer sequencing studies have much more power to detect lowly expressed transcripts, or small differences in gene expression as a result of enhanced dynamic range \citep{Wolf:2013hd,Vijay:2012gy}.

As a direct result of their widespread popularity, a diverse toolset for the assembly and analysis of transcriptome exists. Notable amongst the wide array of tools include several for quality visualization (FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and SolexaQA \citep{Cox:2010ch}) read trimming (e.g. Trimmomatic \citep{Bolger:2014ek} and Cutadapt \citep{Martin:2011va}), read normalization (khmer \citep{Pell:2012id}) and error correction \citep{Le:2013dy}, assembly (Trinity \citep{Haas:2013jq}, SOAPdenovoTrans \citep{Xie:2013wu}) and assembly verificaton (transrate https://github.com/Blahah/transrate and RSEM-eval \citep{Li:2014er}). The ease with which these tools may be used to to produce transcriptome assemblies belies that true complexity underlying the overall process. Indeed, the subtle (and not so subtle) methodological challenges associated with transcriptome reconstruction means that you can easily fuck it up. Amongst the most challenging include isoform reconstruction, simultaneous assembly of low- and high-coverage transcripts, and [] \citep{Modrek:2001ud,Johnson:2003kh}, which together make good transcriptome assembly really difficult. As in child rearing, production of a respectable transcriptome sequence requires a large investment in time and resources. At every step in development, care must be taken correct, but not overcorrect. Here, we propose a set of guidelines for the care and feeding that will result in the production of a well-adjusted transcriptome.
In particular, we focus here our efforts on the early development of the transcriptome, which, unfortunately are often neglected or abused. Particularly flagrant are abuses related to quality control of input data, the lack of understanding the role kmer selection may play in accurate reconstruction, and lastly later in development, abuses related to the lack of post-assembly quality evaluation. Here, we aim to define a set of evidence based analyses and methods aimed at improving transcriptome assembly, which in turn has significant effects on all downstream analyses. To accomplish the proposed standardized methods, we have released a set of version controlled open-sourced code to facilitate this process.