Authorea

Amirali Sharifian edited To_simplify_searching_a_large__.tex over 8 years ago

Commit id: 91509cadcc97001339499d9e79fdb2c75c11d8f1

deletions | additions

The relatively slow speed of hash table based mappers is due to their high sensitivity and comprehensiveness. Such mappers first index fixed-length seeds (also called \emph{k-mers}), typically 10-13 base-pair-long DNA fragments from the reference genome, into a hash table or a similar data structure. Next, they divide each query read into smaller fixed length seeds to query the hash table for their associated seed locations. Finally, they try to extend the read at each of the seed locations by aligning the read to the reference fragment at the seed location via dynamic programming algorithms such as Needleman-Wunsch \cite{needleman} and SmithWaterman \cite{smith1981identification}, or \emph{simple Hamming distance} calculation for greater speed at the cost of missing potential mappings that contain insertions/deletions (indels). According to \cite{fasthash} using data provided by NGS platform shows most of the \textit{locations} fail to provide correct alignments.This is because the size of the k-mers that form the hash table’s indices are typically very short. Eeven though in mrsFAST-ultra\cite{mrsfastultra} for indexing part author is using a variation of this method but the problem is still consist. The problem is these short k-mers appear in the reference genome much more frequently than the undivided, hundreds of base-pair-long query read. As a result, only a few of the locations of a k-mer, if any, provide correct alignments. In other words, Read mappers identify locations within a reference genome where the read and the reference match within a user-defined error (i.e., insertions, deletions, or substitutions) threshold, e. In practice, e is usually $5\%$ of the read length, but most aligners can be configured to return only the best mapping (the mapping with the fewest errors). As seen in Figure \ref{fig:fig1} \ref{Fig2} \cite{shd}, most potential location mappings tested by seed-and-extend based mappers are incorrect (having more errors than allowed); in fact, when e = $5\%$ of the read length, more than $98\%$ of mappings are incorrect. Since alignment is the primary computationally intensive task performed by seed-and-extend based read mappers \cite{fasthash}, it is crucial that incorrect mappings be rejected efficiently.