Authorea

Mircea Trifan edited Introduction.tex about 10 years ago

Commit id: 9cb11f6664e827f5a54b50aa3a139473b3e1aac3

deletions | additions

\section{Introduction} There are three four types of twitter streams that a ordinary user has acces to: trends, search phrase wich returns up to maximum 1500 tweets, user timeline, streams parametrized by keywords or users and spritzer stream that is 10\% of overall tweets. Theese can be implemented as tab panels in a spread sheet like user interface. Big data processing can be integrated in M3Data. The underlying database is Apache Accumulo and the processing could be done in a pipeline approach by Cascading. Cascading can run on top of Accumulo (or Storm). M3Data blocks can be made for Twitter NER: Users at streams as defined above. Other processing blocks ca be built for the following named entity categories: user prefixed by the @ sign, hash sign RT identified topics, web pages prefixed by http:\\, youtube videos, stock market companies prefixed by $, retweets RT, OpenNLP: people, places, dates dollar sign stock market hash http youtube dates, Freebase news trends entities, news, schema.org. A regex block could make implementation easier. Co-occurence matrix of entities can be identified by Cascading on Accumulo and presented in the spreadsheet interface in all the four stream tabs identified previously. Another M3Data block for tf-idf for concepts as defined in For a corpus of existing tweets, Twitter2011 or 2012 TREC corpus can be used. co-occurence matrix on spritzer or trends or search phrase or schema.org tf-idf for concepts with cascading on accumulo

cascading in M3Data (Lingual) + ML (PMML) OLAP cube UIMA