Supervied Classification

After being processed with bag-of-word model, the features of a tweet can be represented as a vector \(x\) where \(x_{i}\) = \(0\) or \(1\) to indicate the presence or absence of the \(i\)th featured word. Then I train and test four different supervised classifiers with 10 fold cross-validation.

  • Naive Bayes. I firstly used a naive Bayes model to determine the relevance of hashtags to an individual tweet. Using Bayes’ rule, it is can be determined that the posterior probability of \(C_{i}\) given the features of the vector(words presence) \(x_{i}\)...\(x_{n}\) is:

    \[p(C_{i}|x_{1},...,x_{n}) = \frac{p(C_{i})P(x_{1}|C_{i}),...,p(x_{n}|C_{i})}{p(x_{1},...,x_{n})}\]

    \(p(C_{i}|x_{1},...,x_{n})\) is the probability of using hashtag \(C_{i}\) given the vector of words presence, \(p(C_{i})\) is the ratio of the number of times hashtag \(C_{i}\) is used to the total number of tweets with hashtags, \(p(x_{1}|C_{i}),...,p(x_{n}|C_{i})\) is calculated from the existing data of tweets. of tweets with hashtags

  • k-Nearest Neighbors-kNN. In order for kNN to decide whether a document d belongs to a category c, kNN checks whether the k training documents most similar to d belong to c. If the answer is positive for a sufficiently large proportion of them, a positive decision is made.

  • Support Vector Machines-SVM. Considering the high dimensionality of our feature vectors, which means the data is more likely to be linear separable, I was considering of conducting a linear classification. As representing the feature vectors as a sparse vector, I used scikit-learn’s implementation of SVM classifier1, which supports sparse data representations. Then I use a one-against-all strategy to make SVM a multi-class classifier.

  • Random Forest. A random forest is an ensemble of randomly generated decision trees, each one of which will output a prediction value, in this case is the hashtag, and the final output will be the mean of all the outputs of these decision trees. Each decision tree is constructed by using a random subset of the training data by the technique of bootstrap. The Random Forest algorithm is included in scikit-learn and the parameter to tune is the number of estimators(trees).

  • Multilayer Perceptron Finally I implemented my own Multilayer Perceptron classifier(MLP), which is an implementation of artificial neural networks. They consist of a number of nodes, grouped in different fully interconnected layers. Each node is mapping its input values into a set of output values, using an internal function. This way each variable will be processed by one or more of node trees. The outputs of the first layer will become the inputs of the second and so on until at least one node of each layer has been activated. That way each one of the variables can contribute into the final classification decision with an intelligently calculated weight. In this method, the weights between layers of a neural network are learnt using back propagation.


  1. http://scikit-learn.org/stable/modules/svm.html