The Importance of GC-content

This note tries to demonstrate that GC-content is very important in genomics clustering, especially to MarkovBin.

Dataset Description and Some Observations

The dataset contains 5 different species. Fig. \ref{fig:1} shows the GC-content distribution of each species in which one point means one read. The x-axis represents the value of GC-content while the y-axis represents the new group likelihood. The species are distinguished by colors.

The first observation is that when GC-content is extremely large or small the new group likelihood will be large. When the value of GC-content is around 0.5 the new group likelihood will reach minimum value. This is trivial but important to clustering.

The second observation is that species 2 occupies GC-content range from 0.15 to 0.35.

The third observation is that species 1 and species 3 own GC-content range from 0.4 to 0.55. Actually it is very hard to bipartite them.

The last observation is that species 4 and species 5 fill Gc-content range from 0.55 to 0.75. It seems that they are mixed and maybe not easy to separate.

\label{fig:1}GC-content and Logarithm of new group likelihood