The simulated data are generated by MetaSim(Richter 2008). The genomes are fetched from Mational Center for Biotechnology Information (NCBI) database (ftp://ftp.ncbi.nih.gov/genomes/all/) and the NCBI taxonomy is downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/ . We generate 3 datasets with various number of species and abundance.
|Dataset||Number of species||Number of reads||Abundance|
We compare the performance of DirichletCluster with MarkovBin and MetaCluster, which are the best among all the unsupervised binning tools for NGS short reads. For all the 3 methods, we set the input type to paired-end. MarkovBin needs the number of species as input so we set the true value according to each dataset.
The performances of these binning methods are evaluated on precision, sensitivity and number of discovered species as (Wang 2012) does. We also apply adjusted Rand Index as the suggestion from (citation not found: Nguyen_2013) .