ROUGH DRAFT authorea.com/65802

# Abstract

Vertical Hamming Distance (VHD) is a tools to compute hamming distance between two string at “bare metal” speed. Essantialy, it means VHD do comparison between two string at or near speed of system’s processor. Moreover, VHD has an embedded edit-distance base filter which can quickly determine whether the minimum number of edits (the tools only support substitution not insertion and deletion) is smaller than a user defined threshold e. We also improved VHD’s filtering power by considering coding scheme of our alphabet in data genomic databases.

# Introduction

To simplify searching a large database such as the human genome there are two major categories: seed-and-extend heuristic methods and suffix-array mapping methods. The seed-and-extend heuristic is developed based on the observation that for a correct mapping the short query read and its corresponding reference fragment, which is the piece of the reference genome that the query read should map to, must share some brief regions (usually 10-100 base-pair-long) of exact or inexact matches. These shorter shared regions, which indicate high similarity between the query read and the reference fragment, are called seeds. By identifying the seeds of a query read, the mapper narrows down the searching range from the whole genome to only the neighborhood region of each seed. Seeds are generated by preprocessing the reference genome and storing the locations of their occurrences in the reference genome in a separate data structure. During mapping, a seed-and-extend mapper first analyzes the query read to identify the seeds. Then, the mapper tries to extend the read at each of the seed locations via dynamic programming algorithms such as the Smith-Waterman (Smith 1981) or Neddleman-Wunsch (citation not found: needleman) algorithm.

To compare these methods with each other there are three metrics: speed (or performance), sensitivity and comprehensiveness. Essantially, hash table based mappers are more sensitive in compare with suffix-array based mappers but in cost of speed and performance. In addition, they are more comprehensive and more robust to sequence errors and genomic diversity.

The relatively slow speed of hash table based mappers is due to their high sensitivity and comprehensiveness. Such mappers first index fixed-length seeds (also called k-mers), typically 10-13 base-pair-long DNA fragments from the reference genome, into a hash table or a similar data structure. Next, they divide each query read into smaller fixed length seeds to query the hash table for their associated seed locations. Finally, they try to extend the read at each of the seed locations by aligning the read to the reference fragment at the seed location via dynamic programming algorithms such as Needleman-Wunsch (citation not found: needleman) and SmithWaterman (Smith 1981), or simple Hamming distance calculation for greater speed at the cost of missing potential mappings that contain insertions/deletions (indels).

According to (citation not found: fasthash) using data provided by NGS platform shows most of the locations fail to provide correct alignments.This is because the size of the k-mers that form the hash table’s indices are typically very short. Eeven though in mrsFAST-ultra(citation not found: mrsfastultra) for indexing part author is using a variation of this method but the problem is still persist. The problem is short k-mers appear in the reference genome much more frequently than the undivided, hundreds of base-pair-long query read. As a result, only a few of the locations of a k-mer, if any, provide correct alignments. In other words, Read mappers identify locations within a reference genome where the read and the reference match within a user-defined error (i.e., insertions, deletions, or substitutions) threshold, e. In practice, e is usually $$5\%$$ of the read length, but most aligners can be configured to return only the best mapping (the mapping with the fewest errors). As seen in Figure \ref{fig:fig1} (citation not found: shd), most potential location mappings tested by seed-and-extend based mappers are incorrect (having more errors than allowed); in fact, when e = $$5\%$$ of the read length, more than $$98\%$$ of mappings are incorrect. Since alignment is the primary computationally intensive task performed by seed-and-extend based read mappers (citation not found: fasthash), it is crucial that incorrect mappings be rejected efficiently.

Figure (a) shows the edit-distance breakdown and Figure (b) shows the edit-distance distribution of all potential mappings evaluated by mrFAST. Both plots show that a vast majority of the potential mappings have far more errors than the threshold e.(citation not found: shd) \label{fig:fig1}

Many mechanisms have been proposed to efficiently calculate the edit-distance of strings and filter out incorrect mappings. These mechanisms can be divided into five main classes: (i) dynamic programming (DP) algorithms, (ii) SIMD implementations of DP algorithms, (iii) bit-vector implementations of DP algorithms, (iv) Hamming distance calculation, and (v) locality-based filtering mechanisms. For instance (citation not found: swps3) uses approach ii, (citation not found: seqan) uses approach iii, fastHASH(citation not found: fasthash) uses v approach. The nearest approach to VHD which uses bit-parallelism and SIMD instructions to achieve better performance is Shifted Hamming Distance (SHD)(citation not found: shd). SHD by itself, however, is not an edit-distance calculator. SHD is a filter that detects and filters some of the string pairs that have edit-distances that are greater than T, but it does not validate the string pairs that pass the filter regarding if they have edit-distances smaller than T.

In VHD we have focused on two points:

• Speeding up computing hamming distance between two strings

• Filtering out data to reduce number of computations.

VHD combines two techniques to achieve the goals. First it uses bit-parallel method to store and run the tools. Bit-parallel method is designed to fully utilize the entire width of the processor words to reduce the number of instructions that are needed to process data. Second, we apply our filter on our data to reduce number of computation meanwhile our bit-parallel algorithm is running. Morover to increase our filtering power we consider how A,C,G,T should code to have better filtering. Finally we introduce another technique to predict next data that is going to load.

## Storage Layout

The VHD storage layout is inspired by the bit-sliced method (citation not found: O_Neil_1997). In VHD, each sequence break down to fixed-length segments, each of which contains w codes( w is width of processor word). If we code our alphabet with k bits then, the w k-bit codes in a segment are then transposed into k w-bit words. In Figure \ref{fig:fig2} there is an example to how we transpose our data. Data genomic data sets have five character A,C,G,T,N(unknown). We have coded each character with three bits, A(001), C(010), G(011), T(100). Our read in the exampl