this is for holding javascript data
Jenna M. Lang edited Genome Assembly and Annotation.md
almost 10 years ago
Commit id: 5470c03f4fb872c95e25b5695d8b075c28827557
deletions | additions
diff --git a/Genome Assembly and Annotation.md b/Genome Assembly and Annotation.md
index 5174b76..0345f17 100644
--- a/Genome Assembly and Annotation.md
+++ b/Genome Assembly and Annotation.md
...
###Interpretation of A5 stats
The first two numbers shown are the number of contigs and scaffolds, respectively. Defining a "good" or "bad" assembly starts here. A finished assembly would consist of a single contig but that is extremely unlikely to result from short read data. At the other extreme, we would consider a bacterial assembly in 1000 contigs to be very fragmented. In our experience, acceptable bacterial assemblies using Ilumina PE300bp data, assembled with A5, tend to range from 10-200 contigs. It is also worth noting that unless studying genomic organization, the number of contigs is less important than the gene content recovered by the assembly which is typically >99% with this method (Coil et al, submitted).
"Genome Size" and "Longest Scaffold" are simply represented as base pairs. While genome size can vary within taxa, this can be a second sanity check for the assembly. When expecting a 5MB genome, if
you add up the
length of all of the scaffolds and find only assembled genome size is 2MB, a red flag should be raised. "N50" represents the contig size at which at least 50% of the assembly is contained in contigs of that size or larger. This metric, combined with the number of contigs is the most common measure of assembly quality… larger is better.
The number of raw reads/raw nucleotides "Raw reads"/"Raw nt" and error-corrected reads/nucleotides "EC Reads"/"Raw nt" counts are useful for seeing what percentage of the data
got has been discarded. A very large difference between these numbers (the "Pct" stats) would indicate either poor quality input data or significant adaptor
contamination. Adaptor contamination
(with for example a very short library is high when the insert
size). size is too small **(other causes?) Also, do you want to include a little troubleshooting here?**
Finally "X\_cov" shows the average coverage across the genome. For Illumina data we recommend that this number be between ~30X and 100X. Much less than 30X coverage and the quality of any given base in the assembly may come into question. Conversely, too much coverage can reduce the quality of the assembly and require downsampling.
**Instructions or reference for downsampling?**
###Verification of 16S Sequence