Vertical Hamming Distance

Abstract

Vertical Hamming Distance (VHD) is a tools to compute hamming distance between two string at “bare metal” speed. Essantialy, it means VHD do comparison between two string at or near speed of system’s processor. Moreover, VHD has an embedded edit-distance base filter which can quickly determine whether the minimum number of edits (the tools only support substitution not insertion and deletion) is smaller than a user defined threshold e. We also improved VHD’s filtering power by considering coding scheme of our alphabet in data genomic databases.

Introduction

To simplify searching a large database such as the human genome there are two major categories: seed-and-extend heuristic methods and suffix-array mapping methods. The seed-and-extend heuristic is developed based on the observation that for a correct mapping the short query read and its corresponding reference fragment, which is the piece of the reference genome that the query read should map to, must share some brief regions (usually 10-100 base-pair-long) of exact or inexact matches. These shorter shared regions, which indicate high similarity between the query read and the reference fragment, are called seeds. By identifying the seeds of a query read, the mapper narrows down the searching range from the whole genome to only the neighborhood region of each seed. Seeds are generated by preprocessing the reference genome and storing the locations of their occurrences in the reference genome in a separate data structure. During mapping, a seed-and-extend mapper first analyzes the query read to identify the seeds. Then, the mapper tries to extend the read at each of the seed locations via dynamic programming algorithms such as the Smith-Waterman (Smith 1981) or Neddleman-Wunsch (citation not found: needleman) algorithm.

During few years ago several mappers have been developed. These mappers basically work base on two approaches: 1)hash table based, using seed and extend approach and 2) suffix-array and genome compression based mappers that utilize the Burrows-Wheeler Transform and the FerraginaManzini index (BWT-FM). mrFAST/mrsFASt (citation not found: mrfast)(citation not found: mrsfast), drFAST(citation not found: drfast) and RazerS(citation not found: razers) are examples of hash table based method and mappers such as BWA(citation not found: bwa) and Bowtie(citation not found: bowtie) are examples for BWT-FM approach.

To compare these methods with each other there are three metrics: speed (or performance), sensitivity and comprehensiveness. Essantially, hash table based mappers are more sensitive in compare with suffix-array based mappers but in cost of speed and performance. In addition, they are more comprehensive and more robust to sequence errors and genomic diversity.

The relatively slow spee