The Importance of GC-content

This note tries to demonstrate that GC-content is very important in genomics clustering, especially to MarkovBin.

Dataset Description and Some Observations

The dataset contains 5 different species. Fig. \ref{fig:1} shows the GC-content distribution of each species in which one point means one read. The x-axis represents the value of GC-content while the y-axis represents the new group likelihood. The species are distinguished by colors.

The first observation is that when GC-content is extremely large or small the new group likelihood will be large. When the value of GC-content is around 0.5 the new group likelihood will reach minimum value. This is trivial but important to clustering.

The second observation is that species 2 occupies GC-content range from 0.15 to 0.35.

The third observation is that species 1 and species 3 own GC-content range from 0.4 to 0.55. Actually it is very hard to bipartite them.

The last observation is that species 4 and species 5 fill Gc-content range from 0.55 to 0.75. It seems that they are mixed and maybe not easy to separate.

\label{fig:1}GC-content and Logarithm of new group likelihood

The Approach of MarkovBin

MarkovBin splits the whole GC-content region to subregions with equal length first and then optimizes by EM algorithm. EM algorithm is well done and we have no way to compete. But the initial separation is really dangerous. To our dataset, the range of GC-content is from 0.15 to 0.75. There are 5 species so the 5 initial subregions given by MarkovBin are [0.15, 0.27], [0.27, 0.39], [0.39, 0.51], [0.51, 0.63], [0.63, 0.75]. From observations in the previous section, we can tell there is something wrong. MarkovBin separates species 2 into two different regions. Unfortunately, the signals from these two regions are similar and strong. We can imagine that species 2 will fulfill 2 subgroups which is a serious mistake.

Result in Table \ref{tab:1} demonstrates our judgment.

\label{tab:1}Result of MarkovBin on dataset ***
Real Sp. 1 Real Sp. 2 Real Sp. 3 Real Sp. 4 Real Sp. 5
Result Group 1 0 606 0 0 0
Result Group 2 61 390 8 0 0
Result Group 3 917 4 986 60 43
Result Group 4 20 0 4 899 77
Result Group 5 2 0 2 41 880

Reconstruct Dataset

Though our algorithm can do better than MarkovBin on this dataset, it is not good enough. We still can not separate sp.1 and sp.3 since they are really similar. Maybe we can find some other species to replace sp.1 and/or sp.3 to reach better result.

The GIs in the original dataset are (in order): 158333233, 57238731, 18311643, 148545259, 83591340. We use GI 119718918 to replace 18311643. Then the GIs in the new dataset are (in order): 158333233, 57238731, 148545259, 83591340, 119718918. The relationship between GC-content and logarithm of new group likelihood is shown in Fig. \ref{fig:2}.

The result of MarkovBin is in Table \ref{tab:2} and the result of DirichletCluster is in Table \ref{tab:3}