Main Data History
Show Index Toggle 0 comments
  •  Quick Edit
  • Topical Evenlope of Tweets via Hashtag Classification


    With over 300 million registered users producing over 500 million tweets per day(150+ {Amazing} {Twitt...), Twitter has became one of the largest and most popular microblogging websites. To cope with the volume of information shared daily, Twitter has introduced hashtags, keywords prefaced with “#”, to help users categorize and search for tweets, which provide a way for a user to indicate the subject of a tweet in a way that is easy to search for as featured index. However, not all the hashtags are in a “good” format. For example, when a big event just happened, some of the hashtags were not represented in a uniform way or missed some keywords. To retrieve all the information related to this big event, it asks for classifying the tweets and reproducing relatively uniform hashtags. To made this task reachable, I made a simple assumption that a top trends tweet’s hashtag content is a good approximation of its total content(Rosa).

    In this paper I present an algorithm to learn the relationships between the literal content of a tweet and the types of hashtags that could accurately describe that content. To better classify each tweet, I need to overcome problems of text cleaning, dimensionality reduction, and multi-class categorization.

    Proposed Approach

    I firstly try to build an algorithm to exact the features of the tweets, which is a typical NLP approach using different text cleaning method. Then, I utilize the tweets with top trends(most popular hashtags) to develop a model with supervised learning, to classify the related top trends. Finally, to evaluate the capability of the approach, I use the true top trends tweets to verify the learning performance. One difficulty involved in this procefure is dimentionality reduction due to the fact that each word is an feature: whether it’s present in the document or not (0/1), or how many times it appears (an integer >= 0), then I applied various techniques towards this problem.