EECS 492 Chapter 18: Learning From Examples


  • An agent is learning if it improves its performance on future tasks after making observations about the world.

  • This chapter: From a collection of input-output pairs, learn a function that predicts the output for new inputs.

Supervised Learning

  • Task
    Given a training set of N example input-output pairs.
    \((x_1, y_1), (x_2, y_2), ... (x_N, y_N)\)
    where each \(y_j\) was generated by an unkown function \(y = f(x)\), discover a function \(h\) that approximates the true function \(f\).

  • Function \(h\): Hypothesis.

    • Learning is a search through space of possible hypotheses for one that performs well, even on new examples.

    • To measure accuracy of a hypothesis, we use a test set of examples distinct from the training set.

    • A hypothesis generalizes if it correctly predits the value of \(y\) for novel examples.

    • Sometimes \(f\) is stochastic-not strictly a function of \(x\)- which means that we have to learn a conditional probability distribution \(P(Y|x)\)

  • Types of Learning Problems

    • Classification: Type of learning problem for which the output \(y\) is one of a finite set of values (such as \(sunny\), \(cloudy\), or \(rainy\)).

    • Regression: Type of learning problem for which the output \(y\) is a number (such as temperature).

  • Ockham’s Razor: Choose the simplest hypothesis consistent with the data.

  • “In general, there is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better.”

  • Supervised learning is done by choosing the hypothesis \(h^*\) that is most probably given the data:
    \(h^*=argmax_{h\in H}\,P(h|data).\)
    By Bayes: \(h^*=argmax_{h\in H}\,P(data|h)P(h).\)
    \(P(h)\) is high for a degree 1/2 polynomial and low for a higher degree polynomial.

This block failed to display. Double-click this text to correct any errors. Your changes are saved.

Ensemble Learning 18.10 p748

  • So far: Looked at learning methods in which a single hypothesis, chosen from a hypothesis space, is used to make predictions

  • Ensemble Learning: Select a collection, or ensemble, of hypotheses from the hypothesis space and combine their predictions.

  • Motivation: Consider we have an ensemble of K=5 hypotheses and suppose we combine their predictions using simple majority voting. For the ensemble to misclassify a new example, at least three of the five hypotheses have to misclassify it. That is, there is a lower chance of misclassification than with a single hypothesis.

    • Suppose each hypothesis \(h_k\) in the ensemble has an error of \(p\) (the probability that a randomly chosen example is misclassified by \(h_k\) is \(p\).

    • Suppose that the errors made by each hypothesis are independent.


  • Weighted Training Set

    • Each example has an associated weight \(w_j \geq 0\).

    • The higher the weight, the higher the importance attached to it during the learning of a hypothesis.

  • Boosting

    • Begin at \(w_j=1\) for all examples. From this set, it generates the first hypothesis, \(h_1\).

    • Increase the weights for the misclassified examples and decrease the weights for the correctly classified examples to generate \(h_2\).

    • Continue this process until \(K\) hypothese are generated (where \(K\) is an input to the algorithm).

    • The final ensemble hypothesis is a weighted-majority combination of all \(K\) hypotheses, each weighted according to how well it performed on the training set.


    • If the input learning algorithm \(L\) is a weak learning algorithm (\(L\) always return a hypothesis with accuracy that is slightly better than random guessing, \(50\,percent + \epsilon\).), then ADABOOST will return a hypothesis that classifies the training data perfectly for large enough \(K\).

  • Decision stumps: Original hypothesis space which are decision trees with just one test.

Theory of Learning 18.5 p713

  • How can we be sure that our learning algorithm has produced a hypothesis that will predict the correct value for previously unseen inputs?
    Formally: how do we know the hypothesis \(h\) is close enough to the gartget function \(f\) if we don’t know what \(f\) is.

  • Questions

    • How many examples do we need for a “good” \(h\)?

    • What hypothesis space should we use?

    • If the hypothesis space is very complex, can we even find the best \(h\), or do we have to settle for a local maximum in the space of hypotheses?

    • How complex should \(h\) be?

    • How do we avoid overfitting

  • How many examples are needed for learning?

  • Computational Learning Theory: Any hypothesis that is seriously wrong will almost certainly be “found out” with high probability after a small number of examples,