Authorea

Cato edited Gibbs.tex over 10 years ago

Commit id: 903b1d36ca639379f1e4eb4de90b5ecbd91180ba

deletions | additions

H = -\sum_i p_i \log_2 p_i. \end{equation} However, there may be larger-scale patterns in the string. Then we have to make more effort. effort (WHY?). For example, \texttt{0101010101\cdots} has the same entropy as \texttt{010011000111\cdots}, but clearly there's something different going on! I see two cases cases, which have to be treated differently: \begin{enumerate} \item The bits come in chunks. Then we have to ``renormalise'' by using a different basis. E.g. DNA code should be expressed in base 4. For an alphabet of $m$ letters (binary: $m=2$; DNA: $m=4$) the corresponding informational entropy is

\end{equation} We can be more sophisticated than this with dictionary-definition algorithm e.g. Lempel-Ziv comma-counting, which captures repetitive structures (but perhaps not the right ones!). Entropy can become very low as chunks get larger (do we talk about a book in the set of books, or the information content of a book?). \item If the patterns that arise don't correspond to larger units of information as that can be re-labelled (as in point 1, 1), perhaps we have to experiment with conditional probabilities, i.e. develop assess probability strings with {\it memory}. memory} of whole sections of the string. \\ The simplest way to do this seems to be the so-called ``Markov model'': model'' (see section~\ref{sec:markov} below). But what we really need is to construct the {\it correlation function} of the data. \end{enumerate} \subsubsection{Markov model} \label{sec:markov} This is actually much less useful than I originally thought, but here it is anyway. \begin{itemize} \item A zeroth-order Markov model assumes each character is independent of the one before, so the entropy is given by equation~\ref{eqn:indep_entropy}. \item In a first-order Markov model, the probability of each character ($j$) depends on the character immediately preceding it ($i$). Then, \begin{equation} H = -\sum_i p(i) \sum_j p(j|i) \log_2 p(j|i). \end{equation} \item And so on for higher orders. \end{itemize}\end{enumerate} %%-------------------------------------------------------------------------------------- \subsection{Aside: notes on DNA}