The simulated data are generated by MetaSim(Richter 2008). The genomes are fetched from Mational Center for Biotechnology Information (NCBI) database ( and the NCBI taxonomy is downloaded from . We generate 3 datasets with various number of species and abundance.

Basic information about the 3 datasets
Dataset Number of species Number of reads Abundance
Dataset 1 4 50000  26x
Dataset 2 9 50000  12x
Dataset 3 19 200000  25x

We compare the performance of DirichletCluster with MarkovBin and MetaCluster, which are the best among all the unsupervised binning tools for NGS short reads. For all the 3 methods, we set the input type to paired-end. MarkovBin needs the number of species as input so we set the true value according to each dataset.

The performances of these binning methods are evaluated on precision, sensitivity and number of discovered species as (Wang 2012) does. We also apply adjusted Rand Index as the suggestion from (citation not found: Nguyen_2013) .

As we can see, MetaCluster always gets very high sensitivity but the overall performance is limited by number of discovered species. Abundance means a lot to it since MetaCluster gets better result in Dataset 1 and 3 (higher abundance, ~25x) than Dataset 2 (lower abundance, ~12x). For MarkovBin, however, the result becomes unacceptable with the number of species increasing.

DirichletCluster discovers most species in all the three datasets. Lower abundance does have some effect but precision and sensitivity are both acceptable in Dataset 2. Sensitivity of DirichletCluster can be compared to MetaCluster when the abundance is low or the number of species is small. When number of species is large, though the sensitivity is not as good as MetaCluster, the overall performance is much better.