Authorea

Jonathan A. Eisen edited Library Preparation and Sequencing .md over 9 years ago

Commit id: f5fe816bfabb00d9a783bcf9854d05ea55ca2d1e

deletions | additions

The capacity of an Illumina MiSeq with PE300 reads is around 15 Gigabases (Gb), which would result in a coverage of 4300X for a typical bacterium with a 3.5Mb genome. On the HiSeq with PE125bp reads, this would be over 14,000X coverage. Currently, the recommended coverage for a bacterial genome assembly is 20-200X depending on the choice of assembler. Therefore, sequencing a single bacterial genome on a full MiSeq or HiSeq run is a significant waste of money and reagents. Furthermore, some current genome assembly algorithms do not perform well given an excess of data, and require down-sampling (i.e., throwing away data) to achieve the recommended coverage for assembly. We typically multiplex 10-48 genomes on a PE300 MiSeq run and many more on a HiSeq run. If using a kit for library prep, multiplexing is quite straightforward since there are a number of barcoded adaptors that come with the kit. Demultiplexing can be performed by the sequencing facility. ##Collaborate Given the overcapacity of Current Illumina sequencing systems have much greater capacity than is needed forbacterial genomes, sequencing a single genome presents a problem (unless willing genome. This means it can be generally beneficial to pay the ~$2000 total cost and throw away most combine many samples into a single run of the data). Sequencing a machine. Unfortunately, our experience has been that sequencing facilities will typically not "pool" help in the coordination of such pooling of samples from multiple groups (we assume because they don't do not want to oversee the pooling or deal with the associated accounting hassles. However, collaborating with other groups hassles). Therefore this means it is up to the users to carry out such coordination. Though this can sometimes be a great option. Many labs sequence complicated it is generally worth it since one can pool together many genomes or metagenomes on into single runs of a regular basis; adding in system and still get enough data for each project, thus making the sequencing cost per project siginificantly lower. For this to work well, what one additional sample isn't technically very difficult, but it will entail oversight needs to do is to coordinate the use of barcodes to tag each sample, coordination of the pooling and pooling, aand some informatics work at the associated accounting hassles. This will also entail a discussion of barcode compatibility, end to ensure that all barcodes are sufficiently unique for demultiplexing. "demultiplex" samples from each other. ##Downsampling Coverage (read depth) is the average number of reads representing a given nucleotide and is a function of the number and size of genomes pooled onto a run. The optimal amount of coverage depends on the read length, the assembler being used, and other factors. For Illumina data assembled using this workflow we recommend that this number be between 20x and 200x. See our more detailed discussion in section 9.1.3 "Interpretation of A5-miseq stats". If you have coverage significantly higher than 200x and wish to downsample your data we have written a script (sub\_sample\_reads) for this purpose. You will first need to calculate how many reads you want the script to sample. We recommend determining how many reads would be equivalent to 100x coverage (divide the genome size by the average read length and multiply by 100). You can download the script from the figshare zipped script file (http://figshare.com/articles/Miscellaneous_Scripts_for_Workflow/1086285). Create a new directory containing the script (sub\_sample\_reads) and the reads you wish to downsample.