Simulation

The simulated data are generated by MetaSim(Richter 2008). The genomes are fetched from Mational Center for Biotechnology Information (NCBI) database (ftp://ftp.ncbi.nih.gov/genomes/all/) and the NCBI taxonomy is downloaded from ftp://ftp.ncbi.nih.gov/pub/taxonomy/ . We generate 3 datasets with various number of species and abundance.

Basic information about the 3 datasets
Dataset Number of species Number of reads Abundance
Dataset 1 4 50000  26x
Dataset 2 9 50000  12x
Dataset 3 19 200000  25x

We compare the performance of DirichletCluster with MarkovBin and MetaCluster, which are the best among all the unsupervised binning tools for NGS short reads. For all the 3 methods, we set the input type to paired-end. MarkovBin needs the number of species as input so we set the true value according to each dataset.

The performances of these binning methods are evaluated on precision, sensitivity and number of discovered species as (Wang 2012) does. We also apply adjusted Rand Index as the suggestion from (citation not found: Nguyen_2013) .