The Importance of GC-content

This note tries to demonstrate that GC-content is very important in genomics clustering, especially to MarkovBin.

Dataset Description and Some Observations

The dataset contains 5 different species. Fig. \ref{fig:1} shows the GC-content distribution of each species in which one point means one read. The x-axis represents the value of GC-content while the y-axis represents the new group likelihood. The species are distinguished by colors.

The first observation is that when GC-content is extremely large or small the new group likelihood will be large. When the value of GC-content is around 0.5 the new group likelihood will reach minimum value. This is trivial but important to clustering.

The second observation is that species 2 occupies GC-content range from 0.15 to 0.35.

The third observation is that species 1 and species 3 own GC-content range from 0.4 to 0.55. Actually it is very hard to bipartite them.

The last observation is that species 4 and species 5 fill Gc-content range from 0.55 to 0.75. It seems that they are mixed and maybe not easy to separate.

\label{fig:1}GC-content and Logarithm of new group likelihood

The Approach of MarkovBin

MarkovBin splits the whole GC-content region to subregions with equal length first and then optimizes by EM algorithm. EM algorithm is well done and we have no way to compete. But the initial separation is really dangerous. To our dataset, the range of GC-content is from 0.15 to 0.75. There are 5 species so the 5 initial subregions given by MarkovBin are [0.15, 0.27], [0.27, 0.39], [0.39, 0.51], [0.51, 0.63], [0.63, 0.75]. From observations in the previous section, we can tell there is something wrong. MarkovBin separates species 2 into two different regions. Unfortunately, the signals from these two regions are similar and strong. We can imagine that species 2 will fulfill 2 subgroups which is a serious mistake.

Result in Table \ref{tab:1} demonstrates our judgment.