Theory of Learning 18.5 p713

How can we be sure that our learning algorithm has produced a hypothesis that will predict the correct value for previously unseen inputs?
Formally: how do we know the hypothesis \(h\) is close enough to the gartget function \(f\) if we don’t know what \(f\) is.
Questions
- How many examples do we need for a “good” \(h\)?
- What hypothesis space should we use?
- If the hypothesis space is very complex, can we even find the best \(h\), or do we have to settle for a local maximum in the space of hypotheses?
- How complex should \(h\) be?
- How do we avoid overfitting
How many examples are needed for learning?
Computational Learning Theory: Any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples, because it will make an incorrect prediction.
Probably Approximately Correct: Thus, any hypothesis that is consistent with a sufficiently large st of training examples is unlike to be seriously wrong.
PAC Learning
- Any learning algorithm that returns hypotheses that are probably approximately correct; use to provide bounds on the performanec of various learning algorithms.
- Stationary Assumption
  - Future examples are going to be drawn from the same fixed distribution \(P(E) = P(X,Y)\) as past examples
  - Assume that the true function \(f\) is deterministic and is a member of the hypothesis class \(H\) that is being considered.
Boolean Function PACs
- Error rate of hypothesis \(h\) is the expected generalization error for exampels drawn from the stationary distribution
  error(\(h\))\(=GenLoss_{L_{0/1}}(h) = \sum\limits_{x,y}L_{0,1}(y,h(x))\,P(x,y).\)
  In other words, error(\(h\)) is the probability that \(h\) misclassifies a new example.
- A hypothesis \(h\) is approximately correct if error(\(h\)) \(\leq \epsilon\), where \(\epsilon\) is a small enough constant.
- We can calculate the probability that a “seriously wrong” hypothesis \(h_b \in H_{bad}\) is consistent with the first \(N\) examples.
  We know that error(\(h_b\)) \(>\) \(\epsilon\).
  Thus the probability that it agrees with a given examples is at most \(1-\epsilon\).
  \(P(h_0\) agrees with \(N\) examples) \(\leq (1-\epsilon)^N.\)
- Probability that \(H_{bad}\) contains at least one consistent hypothesis is bounded by the sum of the individual probabilities:
  \(P(H_{bad}\) contains a consistent hypothesis) \(\leq |H_{bad}|(1-\epsilon)^N \leq \, |H|(1-\epsilon)^N\)
- If \(|H|(1-\epsilon)^N \leq \delta\), then the hypothesis is probably approximately correct.
- \(|H| = 2^{2^N}\) if H is the set of all Boolean functions on \(n\) attributes.
Decision Lists
- Idea: Focus on learnable subsets of the entire hypothesis space of Boolean functions.
- A decision list consists of a series of tests, each of which is a conjunction of literals.
  - If a test succeeds when applied to an example description, the decision list specifies the value to be returned.
  - I the test fails, processing continues with the next etst in the list
  - Lists resemble decision trees, but they only branch in one direction
- \(k\)-dl: each test can have at most \(k\) literals.
- Polynomial sample complexity.
- Learning decision lists: Greedily select a test that homogeneously matches the subset of outstanding examples and append it to the list.