Ed Hall edited Sequence Quality Control and Analysis.tex  almost 10 years ago

Commit id: be48a7d5172b347032d194144399b7f4ebb1cee2

deletions | additions      

       

The 16S sequence collection was demultiplexed and sequences with sample barcodes not matching expected barcodes were discarded. We used the maximum expected error metric \cite{23955772} calculated from sequence quality scores to cull poor quality sequences from the dataset. Specifically, we discarded any sequence with a maximum expected error count greater than 1 after truncating to 175 nt. The forward primer and barcode was trimmed from the remaining reads. We checked that all primer trimmed, error screened and truncated sequences were derived from the same region of the LSU or SSU rRNA gene (23S and 16S sequences, respectively) by aligning the reads to Silva LSU or SSU rRNA gene alignment (“Ref” collection, release 115) with the Mothur \cite{19801464} NAST-algorithm \cite{16845035} aligner and inspecting the alignment coordinates. Reads falling outside the expected alignment coordinates were culled from the dataset. Remaining reads were trimmed to consistent alignment coordinates such that all reads began and ended at the same position in the SSU rRNA gene and screened for chimeras with UChime in “denovo” mode \cite{21700674} via the Mothur UChime wrapper.  \subsubsection{Taxonomic annotations}  Sequences were taxonomically classified using the UClust \cite{20709691} based classifier in the QIIME package \cite{20383131} with the Greengenes database and taxonomic nomenclature (version XXXXX, 97\% OTU representative sequences and corresponding taxonomic annotations, \cite{22134646}) for 16S reads or the Silva LSU database (Ref set, version 115, EMBL taxonomic annotations, \cite{23193283}) for the 23S reads as reference . reference.  We used the default parameters for the algorithm (i.e. minimum consensus of 51\% at any rank, minimum sequence identity for hits at 90\% and the maximum accepted hits value was set to 3). (greengenes) versus 23S (EMBL) \subsubsection{Clustering}  Reads were clustered into OTUs following the UParse pipeline. Specifically USearch (version 7.0.1001) was used to establish cluster centroids at a 97\% sequence identity level from the quality controlled data and map quality controlled reads to the centroids. The initial centroid establishment algorithm incorporates a quality control step wherein potentially chimeric reads are not allowed to become cluster seeds. Additionally, we discarded singleton reads because it is difficult to asses the quality of singleton reads and this quality control parameter in addition to maximum expected error screening has proven to be similarly useful if not superior for reducing 454 sequencing error as “denoising” \cite{23955772}. Moreover, two popular “denoising” algorithms have been shown to add sequencing errors while correcting others sometimes in a nearly equal ratio \cite{22543370}. Eighty-eight and 98\% of quality controlled reads could be mapped back to our cluster seeds at a 97\% identity cutoff for the 16S and 23S sequences, respectively.