Entropy of a bit-string

\label{sec:stringS}

Why is this interesting? Shannon entropy estimates the “surprise” value of some information; if we want to compare two information sets, we may want to build up an intuition of the properties of one set, and see how easy it is to predict the properties of the other.

Not that people are often a little loose with vocabulary – entropy is a property of distributions, not (technically) of individual sequences. We observe the sequence as a variate from some parent distribution, and estimate the entropy of the parent. For this approach to succeed in complicated cases, we will need a long long sequence (see e.g. Farach etal. 1994).

When bits are written randomly and independently, the entropy of the source is given by \[H = -\sum_i p_i \log_2 p_i. \label{eqn:indep_entropy}\]

However, there may be larger-scale patterns in the string. Then we have to make more effort (WHY?). For example, 0101010101 has the same entropy as 010011000111, but clearly there’s something different going on! I see two cases, which have to be treated differently:

The bits come in chunks. Then we have to “renormalise” by using a different basis. E.g. DNA code should be expressed in base 4. For an alphabet of \(m\) letters (binary: \(m=2\); DNA: \(m=4\)) the corresponding informational entropy is \[H = -\sum_i p_i \log_m p_i.\] We can be more sophisticated than this with dictionary-definition algorithm e.g. Lempel-Ziv comma-counting, which captures repetitive structures (but perhaps not the right ones!). Entropy can become very low as chunks get larger (do we talk about a book in the set of books, or the information content of a book?).
If the patterns that arise don’t correspond to larger units of information that can be re-labelled (as in point 1), perhaps we have to experiment with conditional probabilities, i.e. assess probability strings with memory of whole sections of the string.
The simplest way to do this seems to be the so-called “Markov model” (see section \ref{sec:markov} below). But what we really need is to construct the correlation function of the data.

Markov model

\label{sec:markov}

This is actually much less useful than I originally thought, but here it is anyway.

A zeroth-order Markov model assumes each character is independent of the one before, so the entropy is given by equation \ref{eqn:indep_entropy}.
In a first-order Markov model, the probability of each character (\(j\)) depends on the character immediately preceding it (\(i\)). Then, \[H = -\sum_i p(i) \sum_j p(j|i) \log_2 p(j|i).\]
And so on for higher orders.