Authorea

Madison edited Library Preparation and Sequencing .md over 9 years ago

Commit id: 3ed75423fc345cc3aba018323554f19825384e4e

deletions | additions

Insert size: The tradeoff with insert size is between utility for assembly (larger is better) and ability of those fragments to amplify on the Illumina flowcell for sequencing (smaller is better). The optimal fragment size also depends on the length of reads used (with longer read-lengths, longer insert sizes are useful for scaffolding). The final consideration is the amount of DNA available for sequencing. While having all inserts be exactly 750 base pairs (bp) might be ideal, such a stringent size-selection could result in the recovery of only a very small amount of DNA. In our lab, with paired end 300bp (PE300) reads on the Illumina MiSeq, we shoot for a fragment size (including adapters) of 600-900bp. Different sequencing facilities have different opinions on this topic and it is worth having a discussion with your sequencing facility's point of contact before making any libraries. ##Multiplexing The capacity of an Illumina MiSeq with PE300 reads is around 15 Gigabases (Gb) (Gb), which would result in a coverage of 4300X for a typical bacterium with a 3.5Mb genome. On the HiSeq with PE125bp reads, this would be over 14,000X coverage. Currently, the recommended coverage for a bacterial genome assembly is 20-200X depending on the choice of assembler. Therefore, sequencing a single bacterial genome on a full MiSeq or HiSeq run is a significant waste of money and reagents. Furthermore, some current genome assembly algorithms do not perform well given an excess of data, and require down-sampling (i.e., throwing away data) to acheive achieve the recommended coverage for assembly. We typically multiplex 10-48 genomes on a PE300 MiSeq run and many more on a HiSeq run. If using a kit for library prep, multiplexing is quite straightforward since there are a number of barcoded adaptors that come with the kit. Demultiplexing can be performed by the sequencing facility. ##Collaborate Given the overcapacity of Illumina sequencing for bacterial genomes, sequencing a single genome presents a problem (unless willing to pay the ~$2000 total cost and throw away most of the data). Sequencing facilities will typically not "pool" samples from multiple groups because they don't want to oversee the pooling or deal with the associated accounting hassles. However, collaborating with other groups can be a great option. Many labs sequence genomes or metageomes metagenomes on a regular basis; adding in one additional sample isn't technically very difficult, but it will entail oversight of the pooling and the associated accounting hassles. This will also entail a discussion of barcode compatibility, to ensure that all barcodes are sufficiently unique for demultiplexing. ##Downsampling Coverage (read depth) is the average number of reads representing a given nucleotide and is a function of the number and size of genomes pooled onto a run. The optimal amount of coverage depends on the read length, the assembler being used, and other factors. For Illumina data assembled using this workflow we recommend that this number be between 20x and 200x. See our more detailed discussion in section 9.1.3 "Interpretation of A5-miseq stats". If you have coverage significantly higher than 200x and wish to downsample your data we have written a script (sub\_sample\_reads) for this purpose. You will first need to calculate how many reads you want the script to sample. We recommend determining how many reads would be equivalent to 100x coverage (divide the genome size by the average read length and multiply by 100). You can download the script from the figshare zipped script file (http://figshare.com/articles/Miscellaneous_Scripts_for_Workflow/1086285). Create a new directory containing the script (sub\_sample\_reads) and the reads you wish to downsample.