Evaluation

Test Methodology

Data Sampling

Due to that the number of tweets with different hashtags are not uniform from the streaming data is ranging from 300 to 5000, and this distribution would significantly alter the learning performance with certain classifier. Then I firstly chose to focusing on the first 8 largest groups of tweets, the size of which is above 1000. Then relocating the order all these data to a random distribution.

Tuning text preprocessing

When doing preprocessing on the text, I felt compelled to remove stopwords and stem on the original text. Stopwords are commonly occurring words, like “this”, “that”, “and”, “so”, “on”. Is it a good decision? I don’t know, so need to check. Then I first observe the performance changes with or without preprocessing(stopwords removing and stemming).

Tuning # of Features \(K\)

Generally, there are three main approaches to Feature Selection are - Mutual Information based, Chi-square based and Frequency based¹. There are two big univariate feature selection tools in Scikit-Learn: SelectPercentile and SelectKBest. The difference is pretty apparent by the names: SelectPercentile selects the X% of features that are most powerful (where X is a parameter) and SelectKBest selects the K features that are most powerful (where K is a parameter). Scikit-Learn provides several methods to select features based on Chi-Squared and ANOVA F-values for classification. I computed the accuracies with various feature sizes for different classifiers, using SelectKBest with ANOVA F measures.

Validation

Validation is a cornerstone of machine learning. That’s because we’re after generalization to the unknown test examples. Usually the only sensible way to assess how a model generalizes is by using validation: either a single training/validation split if you have enough examples, or cross-validation, which is more computationally expensive but a necessity if there are few training points. My first step is to split the training set. Since I have 15k collecting examples, then I toke 5k for testing and left 10k for training. To avoid over-fitting, all experiment were performed using 10-fold cross-validation.

Meaturement Metrics

In this section, I will present the results of my experiments. I will start with a table presenting precision, recall and accuracy score against four different classifier(Table 1): Naive Bayes, KNN, SVM and Random Forest, with a confusion matrix(Figure 2), showing the most frequently confused hashtags of the classifier which conducted best performing. In the end, I will show the trends of performance changes by increasing the number of features selection.

Performance of Classifiers
Classifier	Precision	Recall	Accurarcy
Naive Bayes	0.83	0.64	0.636
Naive Bayes w/o stopwords	0.83	0.66	0.660
KNN	0.71	0.69	0.685
KNN w/o stopwords	0.73	0.69	0.694
SVM	0.82	0.81	0.813
SVM w/o stopwords	0.85	0.83	0.815
Random Forest	0.82	0.81	0.805
Random Forest w/o stopwords	0.85	0.83	0.818

http://nlp.stanford.edu/IR-book/html/htmledition/feature-selection-1.html↩