Procedure

Information Extraction

Instead of adopting a full-ML approach, we’ve tried to capture linguistic rules by representing them using regular expressions. The general idea behind this technique is specifying regular expressions that capture certain types of information. For example, the expression (watched|seen) (NP), where (NP) denotes a noun phrase, might capture the names of movies (represented by the noun phrase) in a set of documents. By specifying a set of rules like this, it is possible to extract a significant amount of information. The set of regular expressions are often implemented using finite-state transducers which consist of a series of finite-state automata. We’ve used Citrus to declaratively define the parsing expression grammer (PEG) using the Ruby language. For example