Authorea

David Coil edited Genome Assembly and Annotation.md almost 10 years ago

Commit id: a5b0c5514ef76ca38cb227c8311dc0871c0518da

deletions | additions

There are three components to genome assembly validation. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5 (discussed below). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. The third is "completeness" which is difficult to measure except in cases where a close reference is available. Nevertheless, we can get an idea of how complete the genome is by looking for high;y conserved "housekeeping" genes that are found in almost every bacterial genome. To do this, we use a program called Phylosift (\cite{Darling_2014}) to assess the presence or absence of 37 housekeeping genes in the assembly to infer completeness. ###Interpretation of A5 stats The first two numbers shown are the number of contigs and scaffolds, respectively. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig but that is extremely unlikely to result from short read data. At the other extreme, we would consider a bacterial assembly in 1000 contigs to be very fragmented. In our experience, acceptable bacterial assemblies using Ilumina PE300bp data, assembled with A5, tend to range from 10-200 contigs. It is also worth noting that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% with this method using A5 (Coil et al, submitted). "Genome Size" and "Longest Scaffold" are simply represented as base pairs. While genome size can vary within taxa, this can be a second sanity check for the assembly. When expecting a 5MB genome, if the assembled genome size is 2MB, a red flag should be raised. "N50" represents the contig size at which at least 50% of the assembly is contained in contigs of that size or larger. This metric, combined with the number of contigs is the most common measure of assembly quality… larger is better. An N50 of 5,000 bp would be pretty poor... meaning that half of the entire assembly is in contigs smaller than 5,000 bp. On the other hand an N50 of 1,000,000 bp would be great for a bacterial genome. The number of raw reads/raw nucleotides "Raw reads"/"Raw nt" and error-corrected reads/nucleotides "EC Reads"/"Raw nt" counts are useful for seeing what percentage of the data has been discarded. A very large difference between these numbers (the "Pct" stats) would indicate either poor quality input data or significant adaptor contamination. Adaptor contamination is high when the insert size is too small **(other causes?) Also, do you want to include a little troubleshooting here?**