Authorea

Amirali Sharifian edited To_simplify_searching_a_large__.tex over 8 years ago

Commit id: a9e941dc50930ff4cf01b5b73ea49a5a7427248e

deletions | additions

To compare these methods with each other there are three metrics: \emph{speed (or performance)}, \emph{sensitivity} and \emph{comprehensiveness}. Essantially, hash table based mappers are more sensitive in compare with suffix-array based mappers but in cost of speed and performance. In addition, they are more comprehensive and more robust to sequence errors and genomic diversity. The relatively slow speed of hash table based mappers is due to their high sensitivity and comprehensiveness. Such mappers first index fixed-length seeds (also called \emph{k-mers}), typically 10-13 base-pair-long DNA fragments from the reference genome, into a hash table or a similar data structure. Next, they divide each query read into smaller fixed length seeds to query the hash table for their associated seed locations. Finally, they try to extend the read at each of the seed locations by aligning the read to the reference fragment at the seed location via dynamic programming algorithms such as Needleman-Wunsch \cite{needleman} and SmithWaterman \cite{smith1981identification}, or \emph{simple Hamming distance} calculation for greater speed at the cost of missing potential mappings that contain insertions/deletions (indels). According to \cite{fasthash} using data provided by NGS platform shows most of the \textit{locations} fail to provide correct alignments.This is because the size of the k-mers that form the hash table’s indices are typically very short. Eeven though in mrsFAST-ultra\cite{mrsfastultra} for indexing part author is using a variation of this method but the problem is still consist. The problem is these short k-mers appear in the reference genome much more frequently than the undivided, hundreds of base-pair-long query read. As a result, only a few of the locations of a k-mer, if any, provide correct alignments. Naively extending (aligning the read to the reference genome) at all of the locations of all k-mers only introduces unnecessary computation.