Authorea

David Coil edited Library Preparation and Sequencing .md over 9 years ago

Commit id: c9b254b1be6bab9719e502fca4846c8dcb80f74b

deletions | additions

The capacity of an Illumina MiSeq with PE300 reads is around 15 Gigabases (Gb), which would result in a coverage of 4300X for a typical bacterium with a 3.5Mb genome. On the HiSeq with PE125bp reads, this would be over 14,000X coverage. Currently, the recommended coverage for a bacterial genome assembly is 20-200X depending on the choice of assembler. Therefore, sequencing a single bacterial genome on a full MiSeq or HiSeq run is a significant waste of money and reagents. Furthermore, some current genome assembly algorithms do not perform well given an excess of data, and require down-sampling (i.e., throwing away data) to achieve the recommended coverage for assembly. We typically multiplex 10-48 genomes on a PE300 MiSeq run and many more on a HiSeq run. If using a kit for library prep, multiplexing is quite straightforward since there are a number of barcoded adaptors that come with the kit. Demultiplexing can be performed by the sequencing facility. ##Collaborate Current As described above, current Illumina sequencing systems have much greater capacity than is needed for sequencing a single genome. This means it can be generally beneficial to combine many samples into a single run of a machine. Unfortunately, our experience has been that sequencing facilities will typically not help in the coordination of such pooling of samples (we assume because they do not want to oversee the pooling or deal with the associated accounting hassles). Therefore this means it is up to the users to carry out such coordination. Though this can sometimes be complicated it is generally worth it since one can pool together many genomes or metagenomes into single runs of a system and still get enough data for each project, thus making the sequencing cost per project significantly lower. For this to work well, what one needs to do is to coordinate the use of barcodes to tag each sample, coordination of the pooling, and some informatics work at the end to "demultiplex" samples from each other. ##Downsampling Coverage (read depth) is the average number of reads representing a given nucleotide and is a function of the number and size of genomes pooled onto a run. The optimal amount of coverage depends on the read length, the assembler being used, and other factors. For Illumina data assembled using this workflow we recommend that this number be between 20x and 200x. See our more detailed discussion in section 9.1.3 "Interpretation of A5-miseq stats". If you have coverage significantly higher than 200x and wish to downsample your data we have written a script (sub\_sample\_reads) for this purpose. You will first need to calculate how many reads you want the script to sample. We recommend determining how many reads would be equivalent to 100x coverage (divide the genome size by the average read length and multiply by 100). You can download the script from the figshare zipped script file \cite{9a5f8181-40cb-45b4-8f8c-d2abfe9c8cff}. Create a new directory containing the script (sub\_sample\_reads) and the reads you wish to downsample.