Cato edited Gibbs.tex  over 10 years ago

Commit id: 903b1d36ca639379f1e4eb4de90b5ecbd91180ba

deletions | additions      

       

H = -\sum_i p_i \log_2 p_i.  \end{equation}  However, there may be larger-scale patterns in the string. Then we have to make more effort. effort (WHY?).  For example, \texttt{0101010101\cdots} has the same entropy as \texttt{010011000111\cdots}, but clearly there's something different going on! I see two cases cases,  which have to be treated differently: \begin{enumerate}  \item The bits come in chunks. Then we have to ``renormalise'' by using a different basis. E.g. DNA code should be expressed in base 4. For an alphabet of $m$ letters (binary: $m=2$; DNA: $m=4$) the corresponding informational entropy is 

\end{equation}  We can be more sophisticated than this with dictionary-definition algorithm e.g. Lempel-Ziv comma-counting, which captures repetitive structures (but perhaps not the right ones!). Entropy can become very low as chunks get larger (do we talk about a book in the set of books, or the information content of a book?).  \item If the patterns that arise don't correspond to larger units of information as that can be re-labelled (as  in point 1, 1),  perhaps we have to experiment with conditional probabilities, i.e. develop assess  probability strings with {\it memory}. memory} of whole sections of the string. \\  The simplest way to do this seems to be the so-called ``Markov model'': model'' (see section~\ref{sec:markov} below). But what we really need is to construct the {\it correlation function} of the data.   \end{enumerate}     \subsubsection{Markov model}   \label{sec:markov}     This is actually much less useful than I originally thought, but here it is anyway.  \begin{itemize} \item A zeroth-order Markov model assumes each character is independent of the one before, so the entropy is given by equation~\ref{eqn:indep_entropy}.  \item In a first-order Markov model, the probability of each character ($j$) depends on the character immediately preceding it ($i$). Then,  \begin{equation}  H = -\sum_i p(i) \sum_j p(j|i) \log_2 p(j|i).  \end{equation}  \item And so on for higher orders.  \end{itemize}\end{enumerate}  %%--------------------------------------------------------------------------------------  \subsection{Aside: notes on DNA}