Twitter Miner (Final Report)
The aim of our project is to collect Twitter streaming data and implement series of data mining algorithms for three main objectives. These objectives are; event detection, sentimental analysis and user categorization. By analyzing real time events, we will survey user reactions and categorize them based on their interests.
Twitter is one of the fastest growing micro-blogging platforms with their 320 million active users per month . Every user can report events that are happening around him or her. Due to this nature of Twitter, it has become a rich source for detecting, monitoring and analyzing real time events such as natural disasters, health epidemics, political elections, sports matches or release of a new product.
In this work, we aim to detect events from Twitter and based on users’ reactions on that particular event we will be conducting a sentiment analysis. To detect events, we will apply a method to analyze clustering of hashtags or certain keywords. With these available tweets of that particular event, we will detect emotion of users and categorize these users based on their common interests.
As stated in Sakaki et al , to detect events, we will search for specific keywords or hashtags. For event detection we developed an application which generates vectors for Tweets by using TF-IDF. Then we clustered these vectors using k-means algorithm.
We will also be analyzing the feelings and reactions of the users about the events found in the first operation. Emoticons and keywords will be our determining factors for emotions as discussed in Hasan et al . Collected results will be classified as positive and / or negative.
For the third operation we analyzed users’ tweets. We tried to determine a pattern to categorize the users. For example if the event is about a music concert, we determined the user’s common interests from their previous tweets. In Raúl et al , a similar topic was aimed to achieved by using tweepy API. For achieving that purpose, we developed a C# application, which uses Apriori algorithm for selecting frequent hashtags and responsible for creating training model and test file (in arff format). We used these training model and test file in Weka to predict the user groups in Tweets.
Main objective of phase one was to retrieve Tweets from Twitter API. We achieve this goal by distributing different approaches and third party libraries among project members. Each project member tried to implement an algorithm by using these approaches and libraries. Eventually we managed to retrieve data via Twitterizer library. We successfully collected tweets which were tweeted from Ankara location, and related user information and those users’ last twenty tweets and favorites. We plan to continuously collect these information and store in XML files.
For sentiment analysis, text content of tweets will be processed and emoji characters will be extracted. In this case, we will parse the Unicode representation of emoji characters and categorize them as a positive or negative reaction.