Authorea

Comparing the kappa measure and the correct classification percentage of the LibSVM classifier illustrates the problem of unbalanced data. The percentage of correctly classified messages is 74.6% before preprocessing. However, this is only a reflection of the underlying data. As seen in the confusion matrix only 17 messages are classified as bullying, the rest as not bullying. Clearly, the percentage of correctly classified messages is not a good measure of the usefulness of a classifier. The difference preprocessing makes is illustrated better by the kappa measure. Without SMOTE the kappa measure for LibSVM is very low, only 0.0255. This shows that the classifier does not do better than one classifying messages randomly, and that it thus is not very good. After preprocessing all classifiers do better. Also of note is that IBk has a very low proportion of false negatives at the cost of higher proportion of false positives. That is valuable if you want to make sure you miss as few bullying messages as possible. Another method is to use a cost based classifier and set higher costs for falsely classifying a positive. We chose to focus on just one of these methods.