Chuck Pepe-Ranney edited Sequence Quality Control and Analysis.tex  almost 10 years ago

Commit id: c43a333ebc6763a56bb65c1557531a6c45f53475

deletions | additions      

       

Reads were clustered into OTUs following the UParse pipeline. Specifically USearch (version 7.0.1001) was used to establish cluster centroids at a 97\% sequence identity level from the quality controlled data and map quality controlled reads to the centroids. The initial centroid establishment algorithm incorporates a quality control step wherein potentially chimeric reads are not allowed to become cluster seeds. Additionally, we discarded singleton reads because it is difficult to asses the quality of singleton reads and this quality control parameter in addition to maximum expected error screening has proven to be similarly useful if not superior for reducing 454 sequencing error as “denoising” \cite{23955772}. Moreover, two popular “denoising” algorithms have been shown to add sequencing errors while correcting others sometimes in a nearly equal ratio \cite{22543370}. Eighty-eight and 98\% of quality controlled reads could be mapped back to our cluster seeds at a 97\% identity cutoff for the 16S and 23S sequences, respectively.   \subsubsection{Alpha and Beta diversity metrics}  Alpha diversity calculations were made using PyCogent Python bioinformatics modules \cite{17708774}. Beta diversity analyses were made using Phyloseq \cite{24699258} and its dependencies (cite Vegan, Capscape for MDS, etc). Log$_{2}$ fold change estimates and corresponding null hypothesis based significance values were calculated using DESeq2 \cite{Love_2014}. All dispersion estimates from DESeq2 were calculated using with a local fit for mean-dispersion. In each analysis, sparse OTUs that were not found in at least 25\% of all samples were discarded. Additionally, we discarded any OTUs from the 23S data that could not be annotated as belonging in the Eukaryota. All results were visualized using GGPlot2 (R Package Citation) . \cite{Wickham_2009}.