Mircea Trifan edited Introduction.tex  about 10 years ago

Commit id: 9cb11f6664e827f5a54b50aa3a139473b3e1aac3

deletions | additions      

       

\section{Introduction}  There are three four  types of twitter streams that a ordinary user has acces to: trends, search phrase wich returns up to maximum 1500 tweets, user timeline, streams parametrized by keywords or users and spritzer stream that is 10\% of overall tweets. Theese can be implemented as tab panels in a spread sheet like user interface.  Big data processing can be integrated in M3Data. The underlying database is Apache Accumulo and the processing could be done in a pipeline approach by Cascading. Cascading can run on top of Accumulo (or Storm).  M3Data blocks can be made for  Twitter NER:  Users at streams as defined above. Other processing blocks ca be built for the following named entity categories: user prefixed by the @ sign, hash  sign RT identified topics, web pages prefixed by http:\\, youtube videos, stock market companies prefixed by $, retweets RT,  OpenNLP: people, places, dates  dollar sign stock market  hash  http  youtube dates,  Freebase news  trends entities, news, schema.org. A regex block could make implementation easier.  Co-occurence matrix of entities can be identified by Cascading on Accumulo and presented in the spreadsheet interface in all the four stream tabs identified previously. Another M3Data block for tf-idf for concepts as defined in  For a corpus of existing tweets,  Twitter2011 or 2012 TREC  corpus can be used.  co-occurence matrix on spritzer or trends or search phrase or schema.org  tf-idf for concepts with cascading on accumulo 

cascading in M3Data (Lingual) + ML (PMML)  OLAP cube  UIMA