Big Data Digital Signal Processing on Social Networks Graphs
Twitter related problems:
Relation Extraction (Wang 2011)
Classification (Jain 2014)
Event Tracking (Pohl 2013)
Geo search and visualization (Gao 2013)
Trend Mining (Desmier 2013)
Rezidual Analysis of Attributed Graphs (Miller 2013)
Detection Theory (Miller 2013)
Spectral analysis of graphs
Tensors (Miller 2013)
Problems addresed in the framework of DSP on graphs:
Mathematical model of Twitter as a dynamic attributed graph with streams attached to vertexes.
Twitter diffusion of topics or hash terms.
Quering a corpus of dependencies parses of sentences viewed as graphs. (Miller 2012)
Integrated search of documents, multimedia archives and geographic data.
Detection Theory on twitter graphs.
Event Detection. (Dong 2014)
Early Warning System
Big Data Implementation.
Big Data Application Layer for Graphs:
Remove high frequency words as in Lucene
Adapteva or GPU/matlab. http://www.mathworks.com/discovery/gpu-signal-processing.html Use data from SEMEVAL 2014 for sentence semantic relatedness. Dependency parsing based links as Walsh codes, capture relation between words expressed by a vector (word2vec). Unified search from RDF Graphs and unstructured text. Use iconic environment (m3data or Simulink). Study deeplearning4J twitter application. To draw dependency graphs: DependenSee A Dependency Parse Visualisation Tool that makes pictures of Stanford Dependency output. By Awais Athar. (http://nlp.stanford.edu/software/lex-parser.shtml#Sample). Form a document signal by concatenate sentences associated signals. A dependancy graph link source is encoded by a Walsh code and the destination by the code obtained by a rotation with -90 degree. Question-Answering as a decoding-encoding problem or filter(docs)/Fourier(search). Apply at multimedia annotation or unified searching, encryption and watermarking.
Jive search with relations represented as database tabels in M3Data (Campbell 2013), community detection, leadership... implemented as M3Data big data (AROM) by Apache Crunch on top of Spark with collaborative interface by NoFlo
There are four types of twitter streams that a ordinary user has acces to: trends, search phrase wich returns up to maximum 1500 tweets, user timeline, streams parametrized by keywords or users and spritzer stream that is 10% of overall tweets. Theese can be implemented as tab panels in a spread sheet like user interface. One dimension of the spreadsheet is given by the trending entity and another is based on the stream based on the keywords category. There is a single twitter stream given by all keywords in all categories that is further classified (use a signal processing approach) in individual categories. Another tab can be the co-occurence matrix. Each cell contains the mostly tweeted entity. On cell click the first 20 for example are displayed in a popup along with the corresponding tweets maybe on the right side of the screen. Fuel UX datagrid and Twitter’s Bootstrap are used for the user interface.
Big data processing can be integrated in M3Data. The underlying database is Apache Accumulo and the processing could be done in a pipeline approach by Cascading. Cascading can run on top of Accumulo (or Storm). Another distributed stream based systems are Apache S4 and Storm.
M3Data blocks can be made for Twitter streams as defined above. Other processing blocks ca be built for the following named entity categories: user prefixed by the @ sign, hash (#) sign identified topics, web pages prefixed by http, youtube videos, stock market companies prefixed by $, retweets RT, OpenNLP: people, places, dates, Freebase entities, news, schema.org. A regex block could make implementation easier.
Co-occurence matrix of entities can be identified by Cascading on Accumulo and presented in the spreadsheet interface in all the four stream tabs identified previously. Another M3Data block for tf-idf for concepts as defined in (Arru 2013) that uses wavelets to implement a reccomender system.
For a corpus of existing tweets, Twitter2011 or 2012 TREC corpus or Edinburgh corpus can be used. The SNOW 2014 Data Challenge “is to automatically mine social streams to provide journalists with a set of headlines and complementary information that summarize the newsworthy topics for a number of timeslots (time intervals) of interest.”
Other ideas: (Dong 2014) JIP like interface; As in (Costa 2010) represent streaming phrases as a sum of individual keyword terms as sinusoidal curves and study a feedback loop on keywords for streaming api: for a set of trending keywords identify a new set that is used in the next iteration to feed the system; semantic fields; complex event processing (cep); identify text patterns between entities; unified search for entities; cascading in M3Data (Lingual) + ML (PMML); OLAP cube like operations; UIMA; biginsights; deep, dark web; geo data; combine M3Data security with Accumulo’s cell based security; detection theory applied on streams; Cascading, pattern, D4M, on Accumulo, NoFlo.io; PMML on signal processing and spectral graphs; apply TextRazor type rules to twitter, prolog or CHR; Networks and SDN.
Signal processing on graphs as defined in (Shuman 2013) is an emergent field that extends high-dimensional data analysis to networks and other irregular domains. The graph spectral frequency domain is processed by fundamental operations such as filtering, translation, modulation, dilation and downsampling. A graph signal is defined by a finite collection of samples at each graph vertex. To compute information at each graph node a small neighbourhood of the vertex is considered. The graph Laplacian and its eigenvectors and eigenvalues encodes the graph connectivity. The graph Laplacian eigenvector corresponding to lower eigenvalues are smoother. Other graphs matrices are the normalized graph Laplacian and the random walk matrix. More research has to be done to know when to use each matrix. The authors define generalized operators for signals in graphs. Extensions include: analyzing directed graphs and vertexes with a time serie associated.
In another article: (Shuman 2013a) “Vertex-Frequency Analysis on Graphs” defines the windowed Fourier transform that probably could be used for Twitter
The authors of (Miller 2013) propose the use of detection and estimation theory as defined for vector spaces with Gaussian noise in the context of graph analytics framework creating a new research area at the intersection of this domains. It is applied in the situational awareness cyber security to detect suspicious activity. Small subsets of vertices whose interactions do not fit the typical behaviour are identified. Relationships modeled as a graph are dificult to be analized in the Detection Theory framework. Translation and scaling are difficult to define for combinatorial and discrete graphs.
(Miller 2012) describes the D4M architecture for handling large graphs. Analizes the Web of Science database and finds clutter and identifies emerging clusters in the eigenspace of graph residuals. “Future work will include developing automated methods to filter away clutter in the graph data, incorporating metadata into our models of graph residuals, and analyzing multi-graphs in which different types of edges correspond to different relationships.”
(Miller 2010) identifies a couple of graph problems that can be solved using signal processing: subgraph finding in a larger graph can be solved with match filtering for signal detection; very dense subgraph detection; frequency occuring subgraph or a certain behavioral pattern; The subgraph matching is further analysed in the framework of detection theory.
(Arru 2013) describes a signal based model with a time dimension for determining user interests. Wavelets are used to determine similarity among users, implementing user recommendation. A new user model bag of signal operating on pseudo documents is proposed.
(Dong 2014) studies Multi-Scale Event Detection in social media. It used the Wavelet Transform on a single graph clustering algorithm.