Big Data Digital Signal Processing on Social Networks Graphs
Twitter related problems:
Relation Extraction (Wang 2011)
Classification (Jain 2014)
Event Tracking (Pohl 2013)
Geo search and visualization (Gao 2013)
Trend Mining (Desmier 2013)
Rezidual Analysis of Attributed Graphs (Miller 2013)
Detection Theory (Miller 2013)
Spectral analysis of graphs
Tensors (Miller 2013)
Problems addresed in the framework of DSP on graphs:
Mathematical model of Twitter as a dynamic attributed graph with streams attached to vertexes.
Twitter diffusion of topics or hash terms.
Quering a corpus of dependencies parses of sentences viewed as graphs. (Miller 2012)
Integrated search of documents, multimedia archives and geographic data.
Detection Theory on twitter graphs.
Event Detection. (Dong 2014)
Early Warning System
Big Data Implementation.
Big Data Application Layer for Graphs:
Remove high frequency words as in Lucene
Adapteva or GPU/matlab. http://www.mathworks.com/discovery/gpu-signal-processing.html Use data from SEMEVAL 2014 for sentence semantic relatedness. Dependency parsing based links as Walsh codes, capture relation between words expressed by a vector (word2vec). Unified search from RDF Graphs and unstructured text. Use iconic environment (m3data or Simulink). Study deeplearning4J twitter application. To draw dependency graphs: DependenSee A Dependency Parse Visualisation Tool that makes pictures of Stanford Dependency output. By Awais Athar. (http://nlp.stanford.edu/software/lex-parser.shtml#Sample). Form a document signal by concatenate sentences associated signals. A dependancy graph link source is encoded by a Walsh code and the destination by the code obtained by a rotation with -90 degree. Question-Answering as a decoding-encoding problem or filter(docs)/Fourier(search). Apply at multimedia annotation or unified searching, encryption and watermarking.
Jive search with relations represented as database tabels in M3Data (Campbell 2013), community detection, leadership... implemented as M3Data big data (AROM) by Apache Crunch on top of Spark with collaborative interface by NoFlo
There are four types of twitter streams that a ordinary user has acces to: trends, search phrase wich returns up to maximum 1500 tweets, user timeline, streams parametrized by keywords or users and spritzer stream that is 10% of overall tweets. Theese can be implemented as tab panels in a spread sheet like user interface. One dimension of the spreadsheet is given by the trending entity and another is based on the stream based on the keywords category. There is a single twitter stream given by all keywords in all categories that is further classified (use a signal processing approach) in individual categories. Another tab can be the co-occurence matrix. Each cell contains the mostly tweeted entity. On cell click the first 20 for example are displayed in a popup along with the corresponding tweets maybe on the right side of the screen. Fuel UX datagrid and Twitter’s Bootstrap are used for the user interface.
Big data processing can be integrated in M3Data. The underlying database is Apache Accumulo and the processing could be done in a pipeline approach by Cascading. Cascading can run on top of Accumulo (or Storm). Another distributed stream based systems are Apache S4 and Storm.
M3Data blocks can be made for Twitter streams as defined above. Other processing blocks ca be built for the following named entity categories: user prefixed by the @ sign, hash (#) sign identified topics, web pages prefixed by http, youtube videos, stock market companies prefixed by $, retweets RT, OpenNLP: people, places, dates, Freebase entities, news, schema.org. A regex block could make implementation easier.
Co-occurence matrix of entities can be identified by Cascading on Accumulo and presented in the spreadsheet interface in all the four stream tabs identified previously. Another M3Data block for tf-idf for concepts as defined in (Arru 2013) that uses wavelets to implement a reccomender system.
For a corpus of existing tweets, Twitter2011 or 2012 TREC corpus or Edinburgh corpus can be used. The SNOW 2014 Data Challenge “is to automatically mine social streams to provide journalists with a set of headlines and complementary information that summarize the newsworthy topics for a number of timeslots (time intervals) of interest.”
Other ideas: (Dong 2014) JIP like interface; As in (Costa 2010) represent streaming phrases as a sum of individual keyword terms as sinusoidal curves and study a feedback loop on keywords for streaming api: for a set of trending keywords identify a new set that is used in the next iteration to feed the system; semantic fields; complex event processing (cep); identify text patterns between entities; unified search for entities; cascading in M3Data (Lingual) + ML (PMML); OLAP cube like operations; UIMA; biginsights; deep, dark web; geo data; combine M3Data security with Accumulo’s cell based security; detection theory applied on streams; Cascading, pattern, D4M, on Accumulo, NoFlo.io; PMML on signal processing and spectral graphs; apply TextRazor type rules to twitter, prolog or CHR; Networks and SDN.
Signal processing on graphs as defined in (Shuman 2013) is an emergent field that extends high-dimensional data analysis to networks and other irregular domains. The graph spectral frequency domain is processed by fundamental operations such as filtering, translation, modulation, dilation and downsampling. A graph signal is defined by a finite collection of samples at each graph vertex. To compute information at each graph node a small neighbourhood of the vertex is considered. The graph Laplacian and its eigenvectors and eigenvalues encodes the graph connectivity. The graph Laplacian eigenvector corresponding to lower eigenvalues are smoother. Other graphs matrices are the normalized graph Laplacian and the random walk matrix. More research has to be done to know when to use each matrix. The authors define generalized operators for signals in graphs. Extensions include: analyzing directed graphs and vertexes with a time serie associated.
In another article: (Shuman 2013a) “Vertex-Frequency Analysis on Graphs” defines the windowed Fourier transform that probably could be used for Twitter
The authors of (Miller 2013) propose the use of detection and estimation theory as defined for vector spaces with Gaussian noise in the context of graph analytics framework creating a new research area at the intersection of this domains. It is applied in the situational awareness cyber security to detect suspicious activity. Small subsets of vertices whose interactions do not fit the typical behaviour are identified. Relationships modeled as a graph are dificult to be analized in the Detection Theory framework. Translation and scaling are difficult to define for combinatorial and discrete graphs.
(Miller 2012) describes the D4M architecture for handling large graphs. Analizes the Web of Science database and finds clutter and identifies emerging clusters in the eigenspace of graph residuals. “Future work will include developing automated methods to filter away clutter in the graph data, incorporating metadata into our models of graph residuals, and analyzing multi-graphs in which different types of edges correspond to different relationships.”
(Miller 2010) identifies a couple of graph problems that can be solved using signal processing: subgraph finding in a larger graph can be solved with match filtering for signal detection; very dense subgraph detection; frequency occuring subgraph or a certain behavioral pattern; The subgraph matching is further analysed in the framework of detection theory.
(Arru 2013) describes a signal based model with a time dimension for determining user interests. Wavelets are used to determine similarity among users, implementing user recommendation. A new user model bag of signal operating on pseudo documents is proposed.
(Dong 2014) studies Multi-Scale Event Detection in social media. It used the Wavelet Transform on a single graph clustering algorithm.
(Coifman 2006) is a math theoretical article in wich it is shown that in the context of manifolds, graphs, data sets and general metric spaces, diﬀusion processes and Markov processes can be analyzed in a multiscale fashion very much in the spirit of classical wavelet analysis. It is proposed fast and stable algorithms for constructing orthonormal bases for the multiscale approximation spaces, and it is shown that there is an eﬃcient scaling function and wavelet transform.
(Coifman 2011) studies harmonic analysis of databases.
(Sandryhaila 2013) constructs a new signal processing type of graph filter for the classification problem.
(Symeonidis 2013) studies link prediction in biological and social network.
(Robinson 2014) math theoretical
(Campbell 2013) studies the following problems: community detection, relational learning, leadership role prediction. Co-occurences of entities is used to form a graph from text. A more refined approach is to extract relationships from text. A schema with entity types and attributes is described. A graphical query language is used to analyze the entity graph.
(Sizov 2010) studies authority ranking on social networks. The use of hashtags are indicative of emergent semantics in modern social networks as in (Dellschaft 2009). The authors formaly define social graphs and the application of tensors for authority ranking. A SocialWeb graph is defined as a graph G = ( V, L, E, linkType ) where V is the set of users in the community, L is the set of literals (e.g., hashtags), and E is the set of relations between users in V . Additionally, the function linkType : E -> L returns the annotation from L that relates two users. User X links to user Y by edge of type Z iff a) X follows Y (in the common sense of Twitter) and b) both X and Y have recently used the hashtag Z in their own postings/tweets. The authors describe the data collection and transformation processes for Twitter.
(Choudhury 2010) defines diffusion and prediction on social media.
(Altshuler 2012) analyzes trend prediction.
In (Guille 2013) it is studied the information diffusion and prediction. The author surveys the research done, defines a new graph model: Time Based Asynchronous Independent Cascades (T-BaSIC) and provides an open source implementation SOcial Network DYnamics (SONDY). Definitions are provided for message topics, social influence, heard behaviour and informational cascade. A probability of information transmission between nodes is used in the model. Due to “closed world assumption” the model underestimates the volume.
(Jain 2014) studies the harmonic analysis of the co-occurence matrix for short text messages. It uses diffusion wavelets for tweet classification.
(Li 2012) adresses the Named Entity Recognition problem. It uses Wikipedia and Web N-Gram, tweet segmentation.
(Nurwidyantoro 2013) surveys Event Detection: disaster, traffic, outbreak, and news.
(Pohl 2013) uses clustering approaches for sub-event detection related to the topic detection and tracking.
(Gao 2013) describes a novel method for geographical social event detection in social media.
(Ramanathan 2014) “Social Role Recognition for Human Event Understanding”
(Zhou 2013) event detection
(Weng 2011) event detection
(Costa 2010) describes how the Discret Fourier Transform (DFT) can be used for Information Retrieval (IR). A query is represented as a sum of individual query terms repre