Twitter related problems:
Relation Extraction (Wang 2011)
Classification (Jain 2014)
Event Tracking (Pohl 2013)
Geo search and visualization (Gao 2013)
Trend Mining (Desmier 2013)
Rezidual Analysis of Attributed Graphs (Miller 2013)
Detection Theory (Miller 2013)
Spectral analysis of graphs
Tensors (Miller 2013)
Problems addresed in the framework of DSP on graphs:
Mathematical model of Twitter as a dynamic attributed graph with streams attached to vertexes.
Twitter diffusion of topics or hash terms.
Quering a corpus of dependencies parses of sentences viewed as graphs. (Miller 2012)
Integrated search of documents, multimedia archives and geographic data.
Detection Theory on twitter graphs.
Event Detection. (Dong 2014)
Early Warning System
Big Data Implementation.
Big Data Application Layer for Graphs:
Remove high frequency words as in Lucene
Adapteva or GPU/matlab. http://www.mathworks.com/discovery/gpu-signal-processing.html Use data from SEMEVAL 2014 for sentence semantic relatedness. Dependency parsing based links as Walsh codes, capture relation between words expressed by a vector (word2vec). Unified search from RDF Graphs and unstructured text. Use iconic environment (m3data or Simulink). Study deeplearning4J twitter application. To draw dependency graphs: DependenSee A Dependency Parse Visualisation Tool that makes pictures of Stanford Dependency output. By Awais Athar. (http://nlp.stanford.edu/software/lex-parser.shtml#Sample). Form a document signal by concatenate sentences associated signals. A dependancy graph link source is encoded by a Walsh code and the destination by the code obtained by a rotation with -90 degree. Question-Answering as a decoding-encoding problem or filter(docs)/Fourier(search). Apply at multimedia annotation or unified searching, encryption and watermarking.
Jive search with relations represented as database tabels in M3Data (Campbell 2013), community detection, leadership... implemented as M3Data big data (AROM) by Apache Crunch on top of Spark with collaborative interface by NoFlo
There are four types of twitter streams that a ordinary user has acces to: trends, search phrase wich returns up to maximum 1500 tweets, user timeline, streams parametrized by keywords or users and spritzer stream that is 10% of overall tweets. Theese can be implemented as tab panels in a spread sheet like user interface. One dimension of the spreadsheet is given by the trending entity and another is based on the stream based on the keywords category. There is a single twitter stream given by all keywords in all categories that is further classified (use a signal processing approach) in individual categories. Another tab can be the co-occurence matrix. Each cell contains the mostly tweeted entity. On cell click the first 20 for example are displayed in a popup along with the corresponding tweets maybe on the right side of the screen. Fuel UX datagrid and Twitter’s Bootstrap are used for the user interface.
Big data processing can be integrated in M3Data. The underlying database is Apache Accumulo and the processing could be done in a pipeline approach by Cascading. Cascading can run on top of Accumulo (or Storm). Another distributed stream based systems are Apache S4 and Storm.
M3Data blocks can be made for Twitter streams as defined above. Other processing blocks ca be built for the following named entity categories: user prefixed by the @ sign, hash (#) sign identified topics, web pages prefixed by http, youtube videos, stock market companies prefixed by $, retweets RT, OpenNLP: people, places, dates, Freebase entities, news, schema.org. A regex block could make implementation easier.
Co-occurence matrix of entities can be identified by Cascading on Accumulo and presented in the spreadsheet interface in all the four stream tabs identified previously. Another M3Data block for tf-idf for concepts as defined in (Arru 2013) that uses wavelets to implement a reccomender system.
For a corpus of existing tweets, Twitter2011 or 2012 TREC corpus or Edinburgh corpus can be used. The SNOW 2014 Data Challenge “is to automatically mine social streams to provide journalists with a set of headlines and complementary information that summarize the newsworthy topics for a number of timeslots (time intervals) of interest.”
Other ideas: (Dong 2014) JIP like interface; As in (Costa 2010) represent streaming phrases as a sum of individual keyword terms as sinusoidal curves and study a feedback loop on keywords for streaming api: for a set of trending keywords identify a new set that is used in the next iteration to feed the system; semantic fields; complex event processing (cep); identify text patterns between entities; unified search for entities; cascading in M3Data (Lingual) + ML (PMML); OLAP cube like operations; UIMA; biginsights; deep, dark web; geo data; combine M3Data security with Accumulo’s cell based security; detection theory applied on streams; Cascading, pattern, D4M, on Accumulo, NoFlo.io; PMML on signal processing and spectral graphs; apply TextRazor type rules to twitter, prolog or CHR; Networks and SDN.