Aaron Darling edited Genome Assembly and Annotation.md  almost 10 years ago

Commit id: b2735031489adb445cf7d2de7b90f7fa63c76ff4

deletions | additions      

       

###Interpretation of A5-miseq stats  To open A5-miseq stats, import it into excel as a tab deliminated CSV file.  The first two numbers, shown in columns 2 and 3, are the number of contigs and scaffolds. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig with no unresolved nucleotides but that is extremely unlikely to result from short read data. At the other extreme, we would consider a bacterial assembly in 1000 contigs to be very fragmented. In our experience, acceptable bacterial assemblies using Ilumina PE300bp PE300  data, assembled with A5, tend to range from 10-200 contigs. It is also worth noting that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% using A5-miseq (Coil et al, submitted). "Genome Size" and "Longest Scaffold" are simply represented as base pairs. While genome size can vary within taxa, this can be a second useful sanity check for the assembly. When expecting a 5MB genome based on other sequenced isolates from the same genus, if the assembled genome size is 2MB or 10MB, a red flag should be raised. "N50" represents the contig size at which at least 50% of the assembly is contained in contigs of that size or larger. This metric, combined with the number of contigs is the most common measure of assembly quality… larger is better. An N50 of 5,000 bp would be quite poor... meaning that half of the entire assembly is in contigs smaller than 5,000 bp. On the other hand an N50 of 1,000,000 bp is considered very good for bacterial genomes sequenced with Illumina technology.