# Abstract

Vertical Hamming Distance (VHD) is a tools to compute hamming distance between two string at “bare metal” speed. Essantialy, it means VHD do comparison between two string at or near speed of system’s processor. Moreover, VHD has an embedded edit-distance base filter which can quickly determine whether the minimum number of edits (the tools only support substitution not insertion and deletion) is smaller than a user defined threshold e. We also improved VHD’s filtering power by considering coding scheme of our alphabet in data genomic databases.

# Introduction

To simplify searching a large database such as the human genome there are two major categories: seed-and-extend heuristic methods and suffix-array mapping methods. The seed-and-extend heuristic is developed based on the observation that for a correct mapping the short query read and its corresponding reference fragment, which is the piece of the reference genome that the query read should map to, must share some brief regions (usually 10-100 base-pair-long) of exact or inexact matches. These shorter shared regions, which indicate high similarity between the query read and the reference fragment, are called seeds. By identifying the seeds of a query read, the mapper narrows down the searching range from the whole genome to only the neighborhood region of each seed. Seeds are generated by preprocessing the reference genome and storing the locations of their occurrences in the reference genome in a separate data structure. During mapping, a seed-and-extend mapper first analyzes the query read to identify the seeds. Then, the mapper tries to extend the read at each of the seed locations via dynamic programming algorithms such as the Smith-Waterman (Smith 1981) or Neddleman-Wunsch (citation not found: needleman) algorithm.

According to (citation not found: fasthash) using data provided by NGS platform shows most of the locations fail to provide correct alignments.This is because the size of the k-mers that form the hash table’s indices are typically very short. Eeven though in mrsFAST-ultra(citation not found: mrsfastultra) for indexing part author is using a variation of this method but the problem is still persist. The problem is short k-mers appear in the reference genome much more frequently than the undivided, hundreds of base-pair-long query read. As a result, only a few of the locations of a k-mer, if any, provide correct alignments. In other words, Read mappers identify locations within a reference genome where the read and the reference match within a user-defined error (i.e., insertions, deletions, or substitutions) threshold, e. In practice, e is usually $$5\%$$ of the read length, but most aligners can be configured to return only the best mapping (the mapping with the fewest errors). As seen in Figure \ref{fig:fig1} (citation not found: shd), most potential location mappings tested by seed-and-extend based mappers are incorrect (having more errors than allowed); in fact, when e = $$5\%$$ of the read length, more than $$98\%$$ of mappings are incorrect. Since alignment is the primary computationally intensive task performed by seed-and-extend based read mappers (citation not found: fasthash), it is crucial that incorrect mappings be rejected efficiently.