Introduction

With over 300 million registered users producing over 500 million tweets per day\cite{_twitterStats}, Twitter has became one of the largest and most popular microblogging websites. To cope with the volume of information shared daily, Twitter has introduced hashtags, keywords prefaced with “#”, to help users categorize and search for tweets, which provide a way for a user to indicate the subject of a tweet in a way that is easy to search for as featured index. However, not all the hashtags are in a “good” format. For example, when a big event just happened, some of the hashtags were not represented in a uniform way or missed some keywords. To retrieve all the information related to this big event, it asks for classifying the tweets and reproducing relatively uniform hashtags. To made this task reachable, I made a simple assumption that a top trends tweet’s hashtag content is a good approximation of its total content\cite{rosa_topical}.

In this paper I present an algorithm to learn the relationships between the literal content of a tweet and the types of hashtags that could accurately describe that content. To better classify each tweet, I need to overcome problems of text cleaning, dimensionality reduction, and multi-class categorization.