Authorea

David Coil edited Genome Sequencing.md about 9 years ago

Commit id: f6f36679238f5ccfa1f9e1c691c407cb0f58983a

deletions | additions

In our lab, with paired-end 300bp (PE300) reads on the Illumina MiSeq, we target a DNA fragment size (including adapters) of 600-900bp. The high end of the range is constrained by the maximum length of a DNA molecule that can be amplified on the Illumina MiSeq. The low end of the range is defined by the smallest fragment size that will not produce overlapping reads. Ideally, you would sequence only at the high end of the range because longer insert sizes aid in better genome assembly. However, the range is typically expanded to ensure that enough DNA is available for sequencing. Different sequencing facilities have different opinions on this topic and it is worth having a discussion with your sequencing facility's point of contact before making any libraries. It is very important that all samples have similar insert sizes if multiplexing as described below. ##Multiplexing Coverage (also known as read depth) is the average number of reads representing a given nucleotide. It is a function of the number and size of genomes pooled onto a run and the number and length of reads. The optimal amount of coverage depends on the read length, the assembler being used, and other factors. The capacity of an Illumina MiSeq with PE300 reads is around 15 Gigabases (Gb), which would result in a coverage of 4300x for a typical bacterium with a 3.5 Mb genome. On the HiSeq with PE125bp reads, this would be over 14,000x coverage. Currently, the recommended coverage for a bacterial genome assembly is 20-200x depending on the choice of assembler. Therefore, sequencing a single bacterial genome on a full MiSeq or HiSeq run is a significant waste of money and reagents. Furthermore, some current genome assembly algorithms do not perform well given an excess of data, and require down-sampling (_i.e_., throwing away data, Section 8.6) to achieve the recommended coverage for assembly. We typically multiplex 10-48 genomes on a PE300 MiSeq run and many more on a HiSeq run. If using a kit for library prep, multiplexing is quite straightforward since there are a number of barcoded adaptors adapters that come with the kit. We recommend having the sequencing facility demultiplex the samples, as this only requires a list of the barcodes used. ##Collaborate As described above, current Illumina sequencing systems have much greater capacity than is needed for sequencing a single genome. This means it can be generally beneficial to combine many samples into a single run of a machine. Unfortunately, our experience has been that sequencing facilities will typically not help in the coordination of such pooling of samples (we assume because they do not want to oversee the pooling or deal with the associated accounting hassles). Therefore, it is typically up to the users to carry out such coordination. Though this can sometimes be complicated, it is generally worthwhile, since one can pool together many genomes or metagenomes into a single run of a system and still get enough data for each project, thus making the sequencing cost per project significantly lower. For this to work well, one needs to coordinate the use of barcodes to tag each sample, coordinate the pooling, and have available the informatics required to "demultiplex" samples from each other.