Authorea

Jenna M. Lang edited Library Preparation and Sequencing .md over 9 years ago

Commit id: 99ec58076ab8461e3f442fe42368ab9316d935c0

deletions | additions

#Library Preparation and Sequencing ##Library Preparation The first choice in library preparation is whether to do the library prep yourself or to have the library made by your sequencing provider. The economics of this decision are usually dependent on the number of samples involved. For example example, an Illumina TruSeq library prep kit costs around $2600 for 48 samples. That's far cheaper than the $150 to $300 that a typical sequencing provider would charge per sample. However, if you're only preparing a couple of samples there's no reason to buy an entire kit. The requisite time and ancillary consumables and equipment must also be taken into account (see Figure \ref{fig:cost}). Most sequencing facilities offer library preparation services. ##Kit Options Whether you chose to make libraries yourself, yourself or use a service provider, the next major choice is of the type of kit. The two major different most popular choices with Illumina kits are the Nextera transposase-based kits or the TruSeq kits (with or without PCR). These kits are available from Illumina, but there are also comparable options from other vendors (e.g. (_e.g._ New England Biolabs and Kapa Bioscience). The pros and cons of each type of kit are listed below: + Nextera: _Pro_ – It allows for very low amounts of input DNA, down to 1ng in the case of the Nextera XT kit. _Con_ – the transposase has an insertion bias and the extensive PCR required for low input samples will also impact the final assembly\cite{Aird_2011}. + TruSeq (our recommendation): _Pro_ – The PCR-free protocol minimizes library bias by using mechanical instead of enzymatic DNA fragmentation, and by eliminating PCR, resulting the elimination of PCR results in better assemblies. _Con_ – requires a large amount of DNA (at least 1 ug for PCR-free). There is also now a TruSeq LT kit which only requires 100ng of DNA but does entail some and a reduced number of PCR so cycles. This may provide a middle option between PCR-free TruSeq and Nextera. When growing bacteria in culture, culture as described in this workflow, it should almost always be possible to get enough DNA to use PCR-free TruSeq and therefore minimize library preparation biases in the genome assembly. ##Considerations in Library Preparation Insert size: The tradeoff with insert size is between utility for assembly (larger is better) and ability of those fragments to amplify on the Illumina flowcell for sequencing (smaller is better). The optimal fragment size also depends on the length of reads used (with longer read-lengths, longer insert sizes are useful for scaffolding). The final consideration is the amount of DNA available for sequencing. While having all inserts be exactly 750 base pairs (bp) might be ideal, such a stringent size-selection could result in the recovery of only a very small amount of DNA. In our lab, with paired end 300 bp (PE300) reads on the Illumina MiSeq, we shoot for target a fragment size (including adapters) of 600-900 bp. Different sequencing facilities have different opinions on this topic and it is worth having a discussion with your sequencing facility's point of contact before making any libraries. It is very important that all samples have similar library sizes if multiplexing as described below. ##Multiplexing The capacity of an Illumina MiSeq with PE300 reads is around 15 Gigabases (Gb), which would result in a coverage of 4300X for a typical bacterium with a 3.5 Mb genome. On the HiSeq with PE125 bp reads, this would be over 14,000X coverage. Currently, the recommended coverage for a bacterial genome assembly is 20-200X depending on the choice of assembler. Therefore, sequencing a single bacterial genome on a full MiSeq or HiSeq run is a significant waste of money and reagents. Furthermore, some current genome assembly algorithms do not perform well given an excess of data, and require down-sampling (i.e., (_i.e_., throwing away data) to achieve the recommended coverage for assembly. We typically multiplex 10-48 genomes on a PE300 MiSeq run and many more on a HiSeq run. If using a kit for library prep, multiplexing is quite straightforward since there are a number of barcoded adaptors that come with the kit. Demultiplexing can be performed by the sequencing facility. ##Collaborate As described above, current Illumina sequencing systems have much greater capacity than is needed for sequencing a single genome. This means it can be generally beneficial to combine many samples into a single run of a machine. Unfortunately, our experience has been that sequencing facilities will typically not help in the coordination of such pooling of samples (we assume because they do not want to oversee the pooling or deal with the associated accounting hassles). Therefore this means Therefore, it is typically up to the users to carry out such coordination. Though this can sometimes be complicated complicated, it is generally worth it worthwhile, since one can pool together many genomes or metagenomes into a single runs run of a system and still get enough data for each project, thus making the sequencing cost per project significantly lower. For this to work well,what one needs todo is to coordinate the use of barcodes to tag each sample, coordination coordinate of the pooling, and some informatics work at have available the end informatics required to "demultiplex" samples from each other. ##Downsampling Coverage (also known as read depth) is the average number of reads representing a given nucleotide. It is a function of the number and size of genomes pooled onto a run and the number and length of reads. The optimal amount of coverage depends on the read length, the assembler being used, and other factors. For Illumina data assembled using this workflow, we recommend that this number be between 20x and 200x. See our more detailed discussion in section 9.1.3 "Interpretation of A5-miseq stats". If you have coverage significantly higher than 200x and wish to downsample your data data, we have written a script (sub\_sample\_reads) for this purpose. Downsampling should not be neccessary if following the assembly instructions in this workflow. If downsampling, you will first need to calculate how many reads you want the script to sample. We recommend determining how many reads would be equivalent to 100x coverage (divide the genome size by the average read length and multiply by 100). You can download the script from thefigshare zipped script file found on Figshare \cite{9a5f8181-40cb-45b4-8f8c-d2abfe9c8cff}. Create a new directory containing the script (sub\_sample\_reads) and the reads you wish to downsample. To downsample the data data, navigate to the directory you just created (in the terminal) and use the following command ./subsample_reads.pl file1 file2 #_reads_to_keep output_file_name for example