The simulated data are generated by MetaSim(Richter 2008). The genomes are fetched from Mational Center for Biotechnology Information (NCBI) database (ftp://ftp.ncbi.nih.gov/genomes/all/) and the NCBI taxonomy is downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/ . We generate 3 datasets with various number of species and abundance.
|Dataset||Number of species||Number of reads||Abundance|
We compare the performance of DirichletCluster with MarkovBin and MetaCluster, which are the best among all the unsupervised binning tools for NGS short reads. For all the 3 methods, we set the input type to paired-end. MarkovBin needs the number of species as input so we set the true value according to each dataset.
The performances of these binning methods are evaluated on precision, sensitivity and number of discovered species as (Wang 2012) does. We also apply adjusted Rand Index as the suggestion from (citation not found: Nguyen_2013) .
As we can see, MetaCluster always gets very high sensitivity but the overall performance is limited by number of discovered species. Abundance means a lot to it since MetaCluster gets better result in Dataset 1 and 3 (higher abundance, ~25x) than Dataset 2 (lower abundance, ~12x). For MarkovBin, however, the result becomes unacceptable with the number of species increasing.
DirichletCluster discovers most species in all the three datasets. Lower abundance does have some effect but precision and sensitivity are both acceptable in Dataset 2. Sensitivity of DirichletCluster can be compared to MetaCluster when the abundance is low or the number of species is small. When number of species is large, though the sensitivity is not as good as MetaCluster, the overall performance is much better.