Related Work

Digital Signal Processing on Graphs

Signal processing on graphs as defined in \cite{Shuman} is an emergent field that extends high-dimensional data analysis to networks and other irregular domains. The graph spectral frequency domain is processed by fundamental operations such as filtering, translation, modulation, dilation and downsampling. A graph signal is defined by a finite collection of samples at each graph vertex. To compute information at each graph node a small neighbourhood of the vertex is considered. The graph Laplacian and its eigenvectors and eigenvalues encodes the graph connectivity. The graph Laplacian eigenvector corresponding to lower eigenvalues are smoother. Other graphs matrices are the normalized graph Laplacian and the random walk matrix. More research has to be done to know when to use each matrix. The authors define generalized operators for signals in graphs. Extensions include: analyzing directed graphs and vertexes with a time serie associated.

In another article: \cite{Shuman_vertex-frequencyanalysis} “Vertex-Frequency Analysis on Graphs” defines the windowed Fourier transform that probably could be used for Twitter

The authors of \cite{Miller_2013} propose the use of detection and estimation theory as defined for vector spaces with Gaussian noise in the context of graph analytics framework creating a new research area at the intersection of this domains. It is applied in the situational awareness cyber security to detect suspicious activity. Small subsets of vertices whose interactions do not fit the typical behaviour are identified. Relationships modeled as a graph are dificult to be analized in the Detection Theory framework. Translation and scaling are difficult to define for combinatorial and discrete graphs.

\cite{Miller_2012} describes the D4M architecture for handling large graphs. Analizes the Web of Science database and finds clutter and identifies emerging clusters in the eigenspace of graph residuals. “Future work will include developing automated methods to filter away clutter in the graph data, incorporating metadata into our models of graph residuals, and analyzing multi-graphs in which different types of edges correspond to different relationships.”

\cite{Miller_2010} identifies a couple of graph problems that can be solved using signal processing: subgraph finding in a larger graph can be solved with match filtering for signal detection; very dense subgraph detection; frequency occuring subgraph or a certain behavioral pattern; The subgraph matching is further analysed in the framework of detection theory.

\cite{Arru:2013:SUR:2487788.2488088} describes a signal based model with a time dimension for determining user interests. Wavelets are used to determine similarity among users, implementing user recommendation. A new user model bag of signal operating on pseudo documents is proposed.

\cite{Dong_2014} studies Multi-Scale Event Detection in social media. It used the Wavelet Transform on a single graph clustering algorithm.

\cite{Coifman_2006} is a math theoretical article in wich it is shown that in the context of manifolds, graphs, data sets and general metric spaces, diffusion processes and Markov processes can be analyzed in a multiscale fashion very much in the spirit of classical wavelet analysis. It is proposed fast and stable algorithms for constructing orthonormal bases for the multiscale approximation spaces, and it is shown that there is an efficient scaling function and wavelet transform.

\cite{Coifman_2011} studies harmonic analysis of databases.

\cite{Sandryhaila_2013} constructs a new signal processing type of graph filter for the classification problem.

\cite{Symeonidis_2013} studies link prediction in biological and social network.

\cite{Robinson_2014} math theoretical

Social Networks

\cite{Campbell_2014} studies the following problems: community detection, relational learning, leadership role prediction. Co-occurences of entities is used to form a graph from text. A more refined approach is to extract relationships from text. A schema with entity types and attributes is described. A graphical query language is used to analyze the entity graph.

\cite{Sizov_2010} studies authority ranking on social networks. The use of hashtags are indicative of emergent semantics in modern social networks as in \cite{Dellschaft_2009}. The authors formaly define social graphs and the application of tensors for authority ranking. A SocialWeb graph is defined as a graph G = ( V, L, E, linkType ) where V is the set of users in the community, L is the set of literals (e.g., hashtags), and E is the set of relations between users in V . Additionally, the function linkType : E -> L returns the annotation from L that relates two users. User X links to user Y by edge of type Z iff a) X follows Y (in the common sense of Twitter) and b) both X and Y have recently used the hashtag Z in their own postings/tweets. The authors describe the data collection and transformation processes for Twitter.

\cite{Choudhury_2010} defines diffusion and prediction on social media.

\cite{Altshuler_2012} analyzes trend prediction.

In \cite{Guille_2013} it is studied the information diffusion and prediction. The author surveys the research done, defines a new graph model: Time Based Asynchronous Independent Cascades (T-BaSIC) and provides an open source implementation SOcial Network DYnamics (SONDY). Definitions are provided for message topics, social influence, heard behaviour and informational cascade. A probability of information transmission between nodes is used in the model. Due to “closed world assumption” the model underestimates the volume.

\cite{Jain_2014} studies the harmonic analysis of the co-occurence matrix for short text messages. It uses diffusion wavelets for tweet classification.

\cite{Li_2012} adresses the Named Entity Recognition problem. It uses Wikipedia and Web N-Gram, tweet segmentation.

\cite{Nurwidyantoro_2013} surveys Event Detection: disaster, traffic, outbreak, and news.

\cite{Pohl_2013} uses clustering approaches for sub-event detection related to the topic detection and tracking.

\cite{Gao_2013} describes a novel method for geographical social event detection in social media.

\cite{Ramanathan_2014} “Social Role Recognition for Human Event Understanding”

\cite{Zhou_2013} event detection

\cite{Weng_2010} event detection

Spectral Information Retrieval

\cite{Costa_2010} describes how the Discret Fourier Transform (DFT) can be used for Information Retrieval (IR). A query is represented as a sum of individual query terms represented as sinusoidal curves. The query function is transformed in the frequency domain through DFT resulting a spectrum. Each document is considered to be a set of filters that filter the query spectrum resulting a power that ranks the documents. The more the power of the spectrum is decreased the higher the document ranks. The model is named Least Spectral Power Ranking (LSPR). The model can be used for recommender systems.

\cite{Ramamohanarao_2004} is an earlier spectral based method for IR. Apart from term frequencies it uses the term position in the documents represented as the phase of the signal. “It calculates document scores based on the query term spectra derived from query term signals.” To handle signal shifts the Fourier transform is used.

\cite{Caputo_2009} NER for improving IR.

\cite{Li_2011} spectral methods to cluster XML documents.

\cite{He_2011} “Using Multi-Modal Semantic Association Rules to fuse keywords and visual features automatically for Web image retrieval”

NLP

\cite{Mihalcea_2011} Book on graph mehods in NLP

\cite{Basile} Dependency parsing

\cite{Molino} QA: “Exploiting Distributional Semantic Models in Question Answering”

Big Data and Cloud

\cite{Tran_2012} describes the Google Map Reduce paradigm with it’s drawbacks. Another programming model is the Data Flow Graph (DFG). The implementation is open source and can be found at https://github.com/nltran/arom. It is inspired by Microsoft Dryad system \cite{Isard_2007} that was later deprecated in favor of Hadoop.

Spark \cite{Zaharia_2010} is written in Java and exposes the DryadLINQ API.

Shark \cite{Xin_2013}

Cascading

Apache Crunch

Apache S4

Storm

Big Data In Action

Real time big data emerging architecture - Mike Barlow

Big Data Now 2012

\cite{Chen_2014} Survey

\cite{Kepner_2012} D4M