this is for holding javascript data
Jenna M. Lang edited Genome Assembly and Annotation.md
almost 10 years ago
Commit id: 07c2664c501e1159ed6481e8f4765b05cbf1c2c5
deletions | additions
diff --git a/Genome Assembly and Annotation.md b/Genome Assembly and Annotation.md
index 5fb980c..2003551 100644
--- a/Genome Assembly and Annotation.md
+++ b/Genome Assembly and Annotation.md
...
For more on interpreting these numbers proceed to "Assembly Validation".
###Assembly Validation
There are three components to genome assembly validation. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5
(e.g., number of contigs and contig N50). (discussed below). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. The third is "completeness" which is difficult to measure except in cases where a close reference is available.
Here Nevertheless, we can get an idea of how complete the genome is by looking for high;y conserved "housekeeping" genes that are found in almost every bacterial genome. To do this, we use a program called Phylosift
(REF) to assess the presence or absence of 37
highly conserved single copy bacterial housekeeping genes in the assembly
as a rough proxy for to estimate completeness.
###Interpretation of A5 stats
The first two numbers shown are the number of contigs and scaffolds respectively. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig but that is extremely unlikely with short read data. At the other
extreme extreme, a bacterial assembly in 1000 contigs would be very fragmented. In our
experience experience, acceptable bacterial assemblies using
PE300bp Ilumina
data PE300bp data, assembled with
A5 A5, tend to range from 10-200 contigs. It is also worth nothing that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% with this method (Coil et al, submitted).
"Genome Size" and "Longest Scaffold" are simply represented as base-pairs. While genome size can vary within taxa, this can be a second sanity check for the assembly. When expecting a 5MB genome, finding only 2MB in the assembly would be problematic. "N50" represents the contig size at which at least 50% of the assembly is contained in contigs of that size or larger. This metric, combined with the number of contigs is the most common measure of assembly quality… larger is better.