this is for holding javascript data
Guillaume Jospin edited Genome Assembly and Annotation.md
almost 10 years ago
Commit id: ced8c1019a898f00a8d01a4022bee1b78e513b7b
deletions | additions
diff --git a/Genome Assembly and Annotation.md b/Genome Assembly and Annotation.md
index faa7800..a5be1d5 100644
--- a/Genome Assembly and Annotation.md
+++ b/Genome Assembly and Annotation.md
...
For more on interpreting these numbers proceed to "Assembly Validation".
###Assembly Validation
There are three components to genome assembly validation. The first is the overall "quality" of the assembly, assessed by examining the stats provided by A5 (discussed below). The second is verification that the organism sequenced is the organism of interest, simply by checking the assembled 16S sequence with BLAST. The third is "completeness" which is difficult to measure except in cases where a close reference is available. Nevertheless, we can get an idea of how complete the genome is by looking for
high;y highly conserved "housekeeping" genes that are found in almost every bacterial genome. To do this, we use a program called Phylosift (\cite{Darling_2014}) to assess the presence or absence of 37 housekeeping genes in the assembly to infer completeness.
###Interpretation of A5 stats
The first two numbers shown are the number of contigs and scaffolds, respectively. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig but that is extremely unlikely to result from short read data. At the other extreme, we would consider a bacterial assembly in 1000 contigs to be very fragmented. In our experience, acceptable bacterial assemblies using Ilumina PE300bp data, assembled with A5, tend to range from 10-200 contigs. It is also worth noting that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% using A5 (Coil et al, submitted).