Big Data Digital Signal Processing on Social Networks Graphs


Twitter related problems:

  1. Named Entity Recognition (NER) (Li 2012), (Caputo 2009)

  2. Relation Extraction (Wang 2011)

  3. Classification (Jain 2014)

  4. Event Detection (Dong 2014), (Nurwidyantoro 2013), (Pohl 2013), (Gao 2013)

  5. Event Tracking (Pohl 2013)

  6. Geo search and visualization (Gao 2013)

  7. Recommender (citation not found: Arru_2013), (Costa 2010)

  8. Trend Mining (Desmier 2013)

  9. Difusion of topics (Guille 2013), (Altshuler 2012), (Choudhury 2010)

  10. Prediction (Guille 2013), (Symeonidis 2013), (Altshuler 2012), (Sizov 2010), (Choudhury 2010)

  11. Emergence (Miller 2013), (Jain 2014)

Theoretical aspects:

  1. Attributed Graph Model (Miller 2013), (Kim 2010)

  2. Rezidual Analysis of Attributed Graphs (Miller 2013)

  3. Sub-graph matching (Miller 2013), (Kriege 2012)

  4. Diffusion wavelets transform (Wang 2009), (Jain 2014)

  5. Detection Theory (Miller 2013)

  6. Complex networks

  7. Spectral analysis of graphs

  8. Signal Processing

  9. Prediction Annalytics

  10. Tensors (Miller 2013)

Problems addresed in the framework of DSP on graphs:

  1. Mathematical model of Twitter as a dynamic attributed graph with streams attached to vertexes.

  2. Subgraph matching.

  3. Twitter diffusion of topics or hash terms.

  4. Tweet classification.

  5. Quering a corpus of dependencies parses of sentences viewed as graphs. (Miller 2012)

  6. Integrated search of documents, multimedia archives and geographic data.

  7. Detection Theory on twitter graphs.

  8. Recommender Systems. (Arru 2013) + (Li 2012) NER

  9. Event Detection. (Dong 2014)

  10. Predicting Event

  11. Event Summarization

  12. Event Association

  13. Early Warning System

  14. Big Data Implementation.

Big Data Application Layer for Graphs:

  1. Search/Query

    1. Graph Analytics

    2. PageRank

    3. Subgraph Detection

    4. Belief Propagation

    5. Clustering/Classification


Multiuser selection

Remove high frequency words as in Lucene

Adapteva or GPU/matlab. Use data from SEMEVAL 2014 for sentence semantic relatedness. Dependency parsing based links as Walsh codes, capture relation between words expressed by a vector (word2vec). Unified search from RDF Graphs and unstructured text. Use iconic environment (m3data or Simulink). Study deeplearning4J twitter application. To draw dependency graphs: DependenSee A Dependency Parse Visualisation Tool that makes pictures of Stanford Dependency output. By Awais Athar. ( Form a document signal by concatenate sentences associated signals. A dependancy graph link source is encoded by a Walsh code and the destination by the code obtained by a rotation with -90 degree. Question-Answering as a decoding-encoding problem or filter(docs)/Fourier(search). Apply at multimedia annotation or unified searching, encryption and watermarking.

Jive search with relations represented as database tabels in M3Data (Campbell 2013), community detection, leadership... implemented as M3Data big data (AROM) by Apache Crunch on top of Spark with collaborative interface by NoFlo

Represent a topics graph (Hash + NER) as in (Sizov 2010), apply Detection (Miller 2013) to identify emergence.

There are four types of twitter streams that a ordinary user has acces to: trends, search phrase wich returns up to maximum 1500 tweets, user timeline, streams parametrized by keywords or users and spritzer stream that is 10% of overall tweets. Theese can be implemented as tab panels in a spread sheet like user interface. One dimension of the spreadsheet is given by the trending entity and another is based on the stream based on the keywords category. There is a single twitter stream given by all keywords in all categories that is further classified (use a signal processing approach) in individual categories. Another tab can be the co-occurence matrix. Each cell contains the mostly tweeted entity. On cell click the first 20 for example are displayed in a popup along with the corresponding tweets maybe on the right side of the screen. Fuel UX datagrid and Twitter’s Bootstrap are used for the user interface.

Big data processing can be integrated in M3Data. The underlying database is Apache Accumulo and the processing could be done in a pipeline approach by Cascading. Cascading can run on top of Accumulo (or Storm). Another distributed stream based systems are Apache S4 and Storm.

M3Data blocks can be made for Twitter streams as defined above. Other processing blocks ca be built for the following named entity categories: user prefixed by the @ sign, hash (#) sign identified topics, web pages prefixed by http, youtube videos, stock market companies prefixed by $, retweets RT, OpenNLP: people, places, dates, Freebase entities, news, A regex block could make implementation easier.

Co-occurence matrix of entities can be identified by Cascading on Accumulo and presented in the spreadsheet interface in all the four stream tabs identified previously. Another M3Data block for tf-idf for concepts as defined in (Arru 2013) that uses wavelets to implement a reccomender system.

For a corpus of existing tweets, Twitter2011 or 2012 TREC corpus or Edinburgh corpus can be used. The SNOW 2014 Data Challenge “is to automatically mine social streams to provide journalists with a set of headlines and complementary information that summarize the newsworthy topics for a number of timeslots (time intervals) of interest.”

Other ideas: (Dong 2014) JIP like interface; As in (Costa 2010) represent streaming phrases as a sum of individual keyword terms as sinusoidal curves and study a feedback loop on keywords for streaming api: for a set of trending keywords identify a new set that is used in the next iteration to feed the system; semantic fields; complex event processing (cep); identify text patterns between entities; unified search for entities; cascading in M3Data (Lingual) + ML (PMML); OLAP cube like operations; UIMA; biginsights; deep, dark web; geo data; combine M3Data security with Accumulo’s cell based security; detection theory applied on streams; Cascading, pattern, D4M, on Accumulo,; PMML on signal processing and spectral graphs; apply TextRazor type rules to twitter, prolog or CHR; Networks and SDN.