How can we be sure that our learning algorithm has produced a hypothesis that will predict the correct value for previously unseen inputs?
Formally: how do we know the hypothesis \(h\) is close enough to the gartget function \(f\) if we don’t know what \(f\) is.
Questions
How many examples do we need for a “good” \(h\)?
What hypothesis space should we use?
If the hypothesis space is very complex, can we even find the best \(h\), or do we have to settle for a local maximum in the space of hypotheses?
How complex should \(h\) be?
How do we avoid overfitting
How many examples are needed for learning?
Computational Learning Theory: Any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples, because it will make an incorrect prediction.
Probably Approximately Correct: Thus, any hypothesis that is consistent with a sufficiently large st of training examples is unlike to be seriously wrong.
PAC Learning
Any learning algorithm that returns hypotheses that are probably approximately correct; use to provide bounds on the performanec of various learning algorithms.
Stationary Assumption
Future examples are going to be drawn from the same fixed distribution \(P(E) = P(X,Y)\) as past examples
Assume that the true function \(f\) is deterministic and is a member of the hypothesis class \(H\) that is being considered.
Boolean Function PACs
Error rate of hypothesis \(h\) is the expected generalization error for exampels drawn from the stationary distribution
error(\(h\))\(=GenLoss_{L_{0/1}}(h) = \sum\limits_{x,y}L_{0,1}(y,h(x))\,P(x,y).\)
In other words, error(\(h\)) is the probability that \(h\) misclassifies a new example.
A hypothesis \(h\) is approximately correct if error(\(h\)) \(\leq \epsilon\), where \(\epsilon\) is a small enough constant.
We can calculate the probability that a “seriously wrong” hypothesis \(h_b \in H_{bad}\) is consistent with the first \(N\) examples.
We know that error(\(h_b\)) \(>\) \(\epsilon\).
Thus the probability that it agrees with a given examples is at most \(1-\epsilon\).
\(P(h_0\) agrees with \(N\) examples) \(\leq (1-\epsilon)^N.\)
Probability that \(H_{bad}\) contains at least one consistent hypothesis is bounded by the sum of the individual probabilities:
\(P(H_{bad}\) contains a consistent hypothesis) \(\leq |H_{bad}|(1-\epsilon)^N \leq \, |H|(1-\epsilon)^N\)
If \(|H|(1-\epsilon)^N \leq \delta\), then the hypothesis is probably approximately correct.
\(|H| = 2^{2^N}\) if H is the set of all Boolean functions on \(n\) attributes.
Decision Lists
Idea: Focus on learnable subsets of the entire hypothesis space of Boolean functions.
A decision list consists of a series of tests, each of which is a conjunction of literals.
If a test succeeds when applied to an example description, the decision list specifies the value to be returned.
I the test fails, processing continues with the next etst in the list
Lists resemble decision trees, but they only branch in one direction
\(k\)-dl: each test can have at most \(k\) literals.
Polynomial sample complexity.
Learning decision lists: Greedily select a test that homogeneously matches the subset of outstanding examples and append it to the list.