We implement a standard averaged perceptron model, as documented in the assignment specification.

We also repeated several of the experiments used throughout this analysis with a multinomial Naive Bayes classifier. As we were encouraged to use the Pereptron model and considering the limited amount of space available, the results of this comparison have not been included. It is worth noting that the Naive Bayes model seemed to perform no worse that the Perceptron in terms of accuracy, and often beat it. Furthermore, in terms of model run-time the Naive Bayes was very competitive.

Validation Dataset

We validated our model on the Internet Advertisement Data Set. We found with a single pass we achieved a 10-fold average accuracy of 0.949. This is reasonably competitive with other benchmarks on this dataset, so we are satisfied with our implementation.
(See http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements for benchmarks). The best accuracy (0.951) was found by grid-searching on the number of training iterations -- the best value being three passes through.

\[\begin{aligned} a &= b \\ &= c\end{aligned}\]

Wikipedia Text Classification


All 'mean' values are evaluated as average model accuracy across all ten-folds. This is equivalent to using an F1-measure with micro-averaging. Our experiments can be divided into bag-of-word approaches, semantic approaches and combined approached. Bag of word (BOW) approaches include both the standard BOW model as well as the TF-IDF transformation. Our semantic approaches attempted to use POS-taggers and polarity/objectivity measures to predict text class. The combined approaches simply took two or more different feature sets and attempted to combine them, through changing the weighting of each feature set and using \(\chi^2\)-feature selection/reduction.

Bag of Words Approaches (Figures 1,2,3)

A feature vector in BOW model simply measures the frequency of words appearing in the document. The TF-IDF transformation takes this BOW vector and decreases the weighting of words which occur frequently across the corpus.

We perform experiments on both the full paragraph of the Wikipedia pages as well as the textual content of the 'categories' section. We optimise the model through a grid-search on the following parameters:

  • Max DF: The threshold of document frequency after which terms are not included in the model.
  • Stopwords: Whether or not stopwords are removed. Scikit-Learn's stopword removal library was used.
  • N-grams: Whether 1-grams, 2-grams or 3-grams were used in the model (or a combination thereof).
  • Max Features: A cap on the number of features in total to be used. If max features is greater than the vocabuluary of our dataset, we remove the words which occur least frequently.


The best BOW model used the full paragraph text, and achieved an accuracy of 0.738. It had a Max Df. frequency of 0.5, did not use stop-words, used both 1- and 2-grams and did not have a cap on the number of features. When looking at TF-IDF models the best result also used the full paragraph text, with an accuracy of 0.760. Our model parameters were the same as the BOW model, with the exception that optimal performance occurred through capping the number of features at twenty-thousand.

Other relevant results:

  • Adding bi-grams provided a uniform increase in accuracy.
  • For our experiments, fulltext provided an increase in accuracy across experiments.
  • The optimal Max. Df. value varied with other experiment parameters, ranging between 0.5 and 1.
  • Stopword removal is useful only if we limit the maximum number of features.
  • A TF-IDF transformation improved performance across the board.