Topical Evenlope of Tweets via Hashtag Classification


With over 300 million registered users producing over 500 million tweets per day(150+ {Amazing} {Twitt...), Twitter has became one of the largest and most popular microblogging websites. To cope with the volume of information shared daily, Twitter has introduced hashtags, keywords prefaced with “#”, to help users categorize and search for tweets, which provide a way for a user to indicate the subject of a tweet in a way that is easy to search for as featured index. However, not all the hashtags are in a “good” format. For example, when a big event just happened, some of the hashtags were not represented in a uniform way or missed some keywords. To retrieve all the information related to this big event, it asks for classifying the tweets and reproducing relatively uniform hashtags. To made this task reachable, I made a simple assumption that a top trends tweet’s hashtag content is a good approximation of its total content(Rosa).

In this paper I present an algorithm to learn the relationships between the literal content of a tweet and the types of hashtags that could accurately describe that content. To better classify each tweet, I need to overcome problems of text cleaning, dimensionality reduction, and multi-class categorization.

Proposed Approach

I firstly try to build an algorithm to exact the features of the tweets, which is a typical NLP approach using different text cleaning method. Then, I utilize the tweets with top trends(most popular hashtags) to develop a model with supervised learning, to classify the related top trends. Finally, to evaluate the capability of the approach, I use the true top trends tweets to verify the learning performance. One difficulty involved in this procefure is dimentionality reduction due to the fact that each word is an feature: whether it’s present in the document or not (0/1), or how many times it appears (an integer >= 0), then I applied various techniques towards this problem.

Data Collection

The dataset was constructed by requesting public tweets to the Twitter API. I have collected more than 25,000 messages during one week and, considering the worldwide usage of Twitter, tweets were only considered if the user language was defined as English. Considering that users are able to define their own hashtags, a large number of different hashtags is present in the requested tweets, there is also tremendous spam tweets sending every minutes. In order to narrow the number of classes, I decided to only consider the top trends hashtags(To {Trend} or {Not} t...). All the messages that did not have at least one trends were discarded. In addition, Twitter API update their trends every five minutes, then my data collection also need to collecting corresponding tweets dynamically. Finally, there is about 15,000 distinct tweets with 47 distinct trends were collected, and this trend hashtag is marked as the label of each tweet for the future classification task.