Authorea

Madison edited Genome Assembly and Annotation.md almost 10 years ago

Commit id: fabc6d54daa9f6d3b70e0c7e5460666b0a776174

deletions | additions

A5-miseq reports three depth of coverage statistics which can be used to assess whether sufficient data has been collected for genome assembly. First is the "Raw cov" which is simply the total number of base pairs of sequence data, divided by the assembly size. This gives an estimate of the average number of reads covering each base in the assembly. The actual number of reads at each site can and will vary substantially from the average. The second statistic is the "Median cov" which gives the median depth of coverage among all sites in the assembly. That is, 50% of sites will have greater coverage and 50% will have less than this value. "10th percentile cov" indicates a coverage level below which only 10% of sites in the assembly fall. For Illumina data, the ideal median coverage will lie between ~20X and 100X. Much less than 20X median coverage and the quality of individual base calls may be compromised. Ideally, the 10th percentile coverage will be higher than 10, for similar reasons. A separate metric of the base call quality is also reported by A5-miseq as "bases >= Q40". Following assembly, A5-miseq realigns the reads to the assembled sequence and estimates the accuracy of the nucleotide called at each site in the assembly. These accuracies are provided as PHRED quality scores (cite PHRED here), \cite{green2009phrap}, which represent log-scaled probabilities of accuracy. For example a PHRED score of 20 indicates a 99% chance of the correct base, while Q30 and Q40 indicate 99.9% and 99.99% probabilities of the correct base being called. A5-miseq reports the number of assembly bases called with at least Q40. ###Verification of 16S Sequence Follow the steps described in Section 11, "Making a Phylogenetic Tree" for obtaining and performing a BLAST search of the full length 16s sequence.