Authorea

Looking at the results of our ROUGE tests, we find:

Both HDP models are situated in between our upper/lower baselines. This is to be expected and is a reassurance of the validity of our approach.
The F-scores for the variational method were all actually slightly higher than the GB model. Considering that the motivation for VI was its speed, the fact that it seems to perform at least comparably is excellent.
The relative performance of the different ROUGE metrics didn't appear to change dependent on classifier.