ROUGH DRAFT authorea.com/112793
Main Data History
Export
Show Index Toggle 0 comments
  •  Quick Edit
  • Content Aware Similarity Search (CASS)

    Abstract

    A Content Aware Similarity Search (CASS) is designed to examine the contents of entire data packets as well as header and payload information. The network device comprises of a physical interface for transforming analog network signal into bit streams and vice versa. The bit stream coming from the physical interface is forwarded to a traffic flow scanning processor that may be, but is not essentially, broken up into a header processor and a payload analyzer. The header processor examines the header information from each data packet, which is used to find out routing information and session identification. The payload analyzer scans the data packet’s payload and pairs the payload against a database of known strings. The payload analyzer is able to scan across packet boundaries and to scan for strings of variable and arbitrary length. Once the payload has been scanned, the network device can function on the data packet based on the results of the payload analyzer. The scanned data packets and the associated conclusions undergoes a quality of service processor which enhances the data packets if necessary and performs traffic management and traffic shaping on the flow of data packets based on contents of the data packets.

    Keywords

    Data packets, Analog network signal, Routing information, Session identification, Payload analyzer, Database

    INTRODUCTION

    During the previous decades, significant progress has been made on extracting features for similarity pursuit and object recognition from functionality-rich information, for example, sound, picture, video, and other sensor datasets. Since feature-rich data objects are normally spoken to as high-dimensional feature vectors, similarity search is generally actualized as K-Nearest Neighbor (KNN) or Approximate Nearest Neighbors (ANN) search in high-dimensional feature-vector space. The similarity search should have the following properties: Accurate, Time efficient, Space efficient, High-dimensional. In addition, the construction of the index data structure should be quick and it should deal with various sequences of insertions and deletions conveniently. A good search mechanism in an efficient content-based search system for feature-rich data. To start with, it ought to convey list items effectively on large datasets without utilizing much CPU and memory assets. For instance, it ought to have the capacity to pursuit millions of information articles and data objects in seconds. Second, it should to have the capacity to accomplish high search quality by using advanced element extraction techniques. For example, it ought to be able to handle multi-feature vector representations and EMD similarity measure used in a RBIR system. Third, it ought to have the capacity to search data with multiple modalities effectively. For example, when searching the continuous archived information recorded from numerous medical gadgets in an intensive care unit, a user ought to have the capacity to express and search patterns of various data sources. Fourth, it should be able to integrate with the keyword-based search engine. For example, client have the capacity to perform content-based similarity search together with attribute-based search such as time range or annotation-based search. The Content-Aware Similarity Search (CASS) has four current research topics. First is Sketch Construction Techniques. It focuses on how the image was drawn or constructed and its dimension. Its goal is to produce a practical algorithm to construct sketches to substantially reduce the dimension and sizes of feature vectors while achieving high-grade similarity searches. Second is the Efficient filtering and indexing method. This topic is difficult for the developers because the method of indexing and filtering large feature-rich datasets requires similarity match and similarity search and indexing data structures for exact match do not apply. The goal is to find out novel data structures and algorithms to filter and index for similarity search of large amounts of datasets. Third is the Similarity search of multiple data types. The goal is to have a deeper understanding of similarity search of various data types which includes audio, images, documents and many more. Lastly, is the Toolkit for similarity search. This toolkit will be different among the most search toolkits we have right now. The goal is to develop a toolkit that can be used to construct search engines for various data types by plugging in specific data segmentations, feature extractions and distance calculation modules.

    METHODS

    1. THE FERRET TOOLKIT

    The Ferret toolkit is designed to allow systems builders to construct efficient content-based similarity search systems for various feature-rich data types. There are three kinds of components in the Ferret toolkit:

    • Core components - The core similarity search engine, an attribute-based search tool, metadata management, and a command-line query interface are the key elements of the toolkit that are data type independent.

    • Plug-in components - There are two plug-in components: segmentation and feature extraction, and distance functions. These provide systems builders to construct similarity search systems for specific data types.

    • Customizable components - Data acquisition, web interface, and performance evaluation tool provide a set of functions that are commonly needed in a similarity search system.

    Users of the toolkit are able to construct search systems by using components from the toolkit, supplying a small number of data-type specific routines that make up and construct the interface, and customizing the user interface, performance, and data gathering components.

    2. FILTERING

    The brute-force approach is the most general but inefficient method. It goes through all data objects and their distance from the query object, and returns the most similar ones. With large data sets using this approach is extremely inefficient and time consuming.

    The Ferret process makes the process easier into two steps:

    • Filtering step - it quickly filters out “bad” answers and generates a small candidate set

    • Similarity ranking step - it uses multi-feature object distance function, computes distance for each candidate object. It also returns k nearest objects

    Criteria in picking candidate objects:

    Has at least one segment that is close enough to one of the top segments of the query object. By separating the search process into these two steps, one can use a forgiving approximation method in the filtering process to improve search speed and allow the user to apply a sophisticated and perhaps inefficient ranking step to ensure the search quality. If the candidate set size is small, only a small number of EMD computations are needed in the second step. In addition to speed, the filtering step also provides a natural way to integrate the content-based similarity search engine with an attribute-based search engine.

    The goal of filtering is to generate a small candidate set quickly that contains most of the similar data objects. The second is to generate a small and high-quality candidate set quickly.