David Coil edited Genome Assembly and Annotation.md  almost 10 years ago

Commit id: a40cc4a6887815c086c4c75da99596608ca2c753

deletions | additions      

       

"Genome Size" and "Longest Scaffold" are simply represented as base pairs. While genome size can vary within taxa, this can be a second sanity check for the assembly. When expecting a 5MB genome, if the assembled genome size is 2MB, a red flag should be raised. "N50" represents the contig size at which at least 50% of the assembly is contained in contigs of that size or larger. This metric, combined with the number of contigs is the most common measure of assembly quality… larger is better. An N50 of 5,000 bp would be pretty poor... meaning that half of the entire assembly is in contigs smaller than 5,000 bp. On the other hand an N50 of 1,000,000 bp would be great for a bacterial genome.  The number of raw reads/raw nucleotides "Raw reads"/"Raw nt" and error-corrected reads/nucleotides "EC Reads"/"Raw nt" counts are useful for seeing what percentage of the data has been discarded. A very large difference between these numbers ("% reads passing EC"/"% nt passing EC") would indicate either poor quality input data or significant adaptor adapter  contamination. Adaptor contamination can be high when the insert size is too small or if there were problems during library preparation. Finally "X\_cov" shows the average coverage across the genome. AARON DESCRIBE THE COVERAGE STATS HERE.  For Illumina data we recommend that this number be between ~30X and 100X. Much less than 30X coverage and the quality of any given base in the assembly may come into question. Conversely, too much coverage can reduce the quality of the assembly and require downsampling. **Instructions or reference for downsampling?** ###Verification of 16S Sequence  Follow the steps described in Section ??, "Making a Phylogenetic Tree" for obtaining and performing a BLAST search of the full length 16s sequence.