Theory of Learning 18.5 p713

  • How can we be sure that our learning algorithm has produced a hypothesis that will predict the correct value for previously unseen inputs?
    Formally: how do we know the hypothesis \(h\) is close enough to the gartget function \(f\) if we don’t know what \(f\) is.

  • Questions

    • How many examples do we need for a “good” \(h\)?

    • What hypothesis space should we use?

    • If the hypothesis space is very complex, can we even find the best \(h\), or do we have to settle for a local maximum in the space of hypotheses?

    • How complex should \(h\) be?

    • How do we avoid overfitting

  • How many examples are needed for learning?

  • Computational Learning Theory: Any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples, because it will make an incorrect prediction.

  • Probably Approximately Correct: Thus, any hypothesis that is consistent with a sufficiently large st of training examples is unlike to be seriously wrong.

  • PAC Learning

    • Any learning algorithm that returns hypotheses that are probably approximately correct; use to provide bounds on the performanec of various learning algorithms.

    • Stationary Assumption

      • Future examples are going to be drawn from the same fixed distribution \(P(E) = P(X,Y)\) as past examples

      • Assume that the true function \(f\) is deterministic and is a member of the hypothesis class \(H\) that is being considered.

  • Boolean Function PACs

    • Error rate of hypothesis \(h\) is the expected generalization error for exampels drawn from the stationary distribution
      error(\(h\))\(=GenLoss_{L_{0/1}}(h) = \sum\limits_{x,y}L_{0,1}(y,h(x))\,P(x,y).\)
      In other words, error(\(h\)) is the probability that \(h\) misclassifies a new example.

    • A hypothesis \(h\) is approximately correct if error(\(h\)) \(\leq \epsilon\), where \(\epsilon\) is a small enough constant.

    • We can calculate the probability that a “seriously wrong” hypothesis \(h_b \in H_{bad}\) is consistent with the first \(N\) examples.
      We know that error(\(h_b\)) \(>\) \(\epsilon\).
      Thus the probability that it agrees with a given examples is at most \(1-\epsilon\).
      \(P(h_0\) agrees with \(N\) examples) \(\leq (1-\epsilon)^N.\)

    • Probability that \(H_{bad}\) contains at least one consistent hypothesis is bounded by the sum of the individual probabilities:
      \(P(H_{bad}\) contains a consistent hypothesis) \(\leq |H_{bad}|(1-\epsilon)^N \leq \, |H|(1-\epsilon)^N\)

    • If \(|H|(1-\epsilon)^N \leq \delta\), then the hypothesis is probably approximately correct.

    • \(|H| = 2^{2^N}\) if H is the set of all Boolean functions on \(n\) attributes.

  • Decision Lists

    • Idea: Focus on learnable subsets of the entire hypothesis space of Boolean functions.

    • A decision list consists of a series of tests, each of which is a conjunction of literals.

      • If a test succeeds when applied to an example description, the decision list specifies the value to be returned.

      • I the test fails, processing continues with the next etst in the list

      • Lists resemble decision trees, but they only branch in one direction

    • \(k\)-dl: each test can have at most \(k\) literals.

    • Polynomial sample complexity.

    • Learning decision lists: Greedily select a test that homogeneously matches the subset of outstanding examples and append it to the list.