Fundamentals

Optimal Predictors

We start with a simple model to help build intuition and motivate the following definitions.

Consider finite sets \(X\) and \(Y\), a probability distribution \(\mu: X \rightarrow [0,1]\), a mapping \(m: X \rightarrow Y\) and a function \(f: X \rightarrow {\mathbb{R}}\). Suppose \(x\) was sampled from \(\mu\) and we were told \(y := m(x)\) (but not told \(x\) itself). Our expected value of \(f(x)\) in these conditions is \(\operatorname{E}[f(x)] = \operatorname{E}_{x' \sim \mu}[f(x') \mid m(x') = y]\).

Let \(P: X \rightarrow {\mathbb{R}}\) be the function \(P(x) := \operatorname{E}_{x' \sim \mu}[f(x') \mid m(x) = m(x)]\). How can we characterize \(P\) without referring to the concept of a conditional expected value? For any \(Q: X \rightarrow {\mathbb{R}}\) we can consider the “error” \(\operatorname{E}_\mu[(Q - f)^2]\). \(Q\) is called “efficient” when it factors as \(Q = q \circ m\) for some \(q: Y \rightarrow {\mathbb{R}}\). It is easy to see that \(P\) has the least error among all efficient functions.