Content Aware Similarity Search (CASS)

Abstract

A Content Aware Similarity Search (CASS) is designed to examine the contents of entire data packets as well as header and payload information. The network device comprises of a physical interface for transforming analog network signal into bit streams and vice versa. The bit stream coming from the physical interface is forwarded to a traffic flow scanning processor that may be, but is not essentially, broken up into a header processor and a payload analyzer. The header processor examines the header information from each data packet, which is used to find out routing information and session identification. The payload analyzer scans the data packet’s payload and pairs the payload against a database of known strings. The payload analyzer is able to scan across packet boundaries and to scan for strings of variable and arbitrary length. Once the payload has been scanned, the network device can function on the data packet based on the results of the payload analyzer. The scanned data packets and the associated conclusions undergoes a quality of service processor which enhances the data packets if necessary and performs traffic management and traffic shaping on the flow of data packets based on contents of the data packets.

Keywords

Data packets, Analog network signal, Routing information, Session identification, Payload analyzer, Database

INTRODUCTION

During the previous decades, significant progress has been made on extracting features for similarity pursuit and object recognition from functionality-rich information, for example, sound, picture, video, and other sensor datasets. Since feature-rich data objects are normally spoken to as high-dimensional feature vectors, similarity search is generally actualized as K-Nearest Neighbor (KNN) or Approximate Nearest Neighbors (ANN) search in high-dimensional feature-vector space. The similarity search should have the following properties: Accurate, Time efficient, Space efficient, High-dimensional. In addition, the construction of the index data structure should be quick and it should deal with various sequences of insertions and deletions conveniently. A good search mechanism in an efficient content-based search system for feature-rich data. To start with, it ought to convey list items effectively on large datasets without utilizing much CPU and memory assets. For instance, it ought to have the capacity to pursuit millions of information articles and data objects in seconds. Second, it should to have the capacity to accomplish high search quality by using advanced element extraction techniques. For example, it ought to be able to handle multi-feature vector representations and EMD similarity measure used in a RBIR system. Third, it ought to have the capacity to search data with multiple modalities effectively. For example, when searching the continuous archived information recorded from numerous medical gadgets in an intensive care unit, a user ought to have the capacity to express and search patterns of various data sources. Fourth, it should be able to integrate with the keyword-based search engine. For example, client have the capacity to perform content-based similarity search together with attribute-based search such as time range or annotation-based search. The Content-Aware Similarity Search (CASS) has four current research topics. First is Sketch Construction Techniques. It focuses on how the image was drawn or constructed and its dimension. Its goal is to produce a practical algorithm to construct sketches to substantially reduce the dimension and sizes of feature vectors while achieving high-grade similarity searches. Second is the Efficient filtering and indexing method. This topic is difficult for the developers because the method of indexing and filtering large feature-rich datasets requires similarity match and similarity search and indexing data structures for exact match do not apply. The goal is to find out novel data structures and algorithms to filter and index for similarity search of large amounts of datasets. Third is the Similarity search of multiple data types. The goal is to have a deeper understanding of similarity search of various data types which includes audio, images, documents and many more. Lastly, is the Toolkit for similarity search. This toolkit will be different among the most search toolkits we have right now. The goal is to develop a toolkit that can be used to construct search engines for various data types by plugging in specific data segmentations, feature extractions and distance calculation modules.