EECS 492 Chapter 18: Learning From Examples

Introduction

  • An agent is learning if it improves its performance on future tasks after making observations about the world.

  • This chapter: From a collection of input-output pairs, learn a function that predicts the output for new inputs.

Supervised Learning

  • Task
    Given a training set of N example input-output pairs.
    \((x_1, y_1), (x_2, y_2), ... (x_N, y_N)\)
    where each \(y_j\) was generated by an unkown function \(y = f(x)\), discover a function \(h\) that approximates the true function \(f\).

  • Function \(h\): Hypothesis.

    • Learning is a search through space of possible hypotheses for one that performs well, even on new examples.

    • To measure accuracy of a hypothesis, we use a test set of examples distinct from the training set.

    • A hypothesis generalizes if it correctly predits the value of \(y\) for novel examples.

    • Sometimes \(f\) is stochastic-not strictly a function of \(x\)- which means that we have to learn a conditional probability distribution \(P(Y|x)\)

  • Types of Learning Problems

    • Classification: Type of learning problem for which the output \(y\) is one of a finite set of values (such as \(sunny\), \(cloudy\), or \(rainy\)).

    • Regression: Type of learning problem for which the output \(y\) is a number (such as temperature).

  • Ockham’s Razor: Choose the simplest hypothesis consistent with the data.

  • “In general, there is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better.”

  • Supervised learning is done by choosing the hypothesis \(h^*\) that is most probably given the data:
    \(h^*=argmax_{h\in H}\,P(h|data).\)
    By Bayes: \(h^*=argmax_{h\in H}\,P(data|h)P(h).\)
    \(P(h)\) is high for a degree 1/2 polynomial and low for a higher degree polynomial.

Learning Decision Trees


Decision Tree Representation

  • A decision tree represents a function that takes as input a vector of attribute values and returns a “decision”–a single output value.

  • Problems with discrete values: Each example input will be classified as true (positive example) or false (negative example).

  • Each node corresponds to a test of the value of one of the input attributes, \(A_i\), and the branches are labeled with possible values for that attribute, \(A_i = v_{ik}\).

  • Each leaf in the tree specifies a value returned by the function.


Expressiveness of Decisions Trees

  • A boolean decision tree is logically equivalent to the assertion that the goal attribute is true if and only if the input attributes satisfy one of the paths leading to a leaf with value \(true\).
    \(Goal \Leftrightarrow (Path_1 \vee Path_2 \vee ...). \)
    Where each path is also a propositional logic statement:
    \(Path = (Patrons = Full \wedge WaitEstimate = 0-10).\)


Inducing Decision Trees from Examples


(x,y)

  • x (input): vector of values of input attributes

  • y (output): single Boolean output value

No way to find smallest consistent tree with all training set values–\(2^{2^n}\) trees (2^n distinct input attributes, 2^n output values).

  • Decision Tree Learning Algorithm

    • Idea

      • Greedy divide and conquer algorithm.

      • Always test the more “important attribute” first. In other words, the attribute that makes the most difference to the classification of the sample to reduce number of tests.

      • Divide into smaller subproblems that can then be solved recursively.

  • In general, after the first attribute test splits up the examples, each outcomei s a new decision tree learning problem, with fewer examples and one less attribute

  • Recursive Considerations

    • If remaining examples are all positive/negative, then we are done (can answer \(Yes\) or \(No\)).

    • If there are some postiive/negative examples, then choose best attribute to split them.

    • If there are no examples left, it means that no example has been observed for this combination of attribute values, and we return a default value calculated from the plurality of classifications of the examples that were used in constructing the node’s parent (passed through \(parent_examples\)).

    • If there are no attributes left, but positive and negative examples, it means that these examples have exactly the same description but different classifications. Return the plurality classification of remaining examples.

  • Algorithm

  • Danger: Over-interpreting the tree that the algorithm selects.

  • Evaluate accuracy with a learning curve. As training set grows, the accuracy increases.

Ensemble Learning 18.10 p748

  • So far: Looked at learning methods in which a single hypothesis, chosen from a hypothesis space, is used to make predictions

  • Ensemble Learning: Select a collection, or ensemble, of hypotheses from the hypothesis space and combine their predictions.

  • Motivation: Consider we have an ensemble of K=5 hypotheses and suppose we combine their predictions using simple majority voting. For the ensemble to misclassify a new example, at least three of the five hypotheses have to misclassify it. That is, there is a lower chance of misclassification than with a single hypothesis.

    • Suppose each hypothesis \(h_k\) in the ensemble has an error of \(p\) (the probability that a randomly chosen example is misclassified by \(h_k\) is \(p\).

    • Suppose that the errors made by each hypothesis are independent.


Boosting

  • Weighted Training Set

    • Each example has an associated weight \(w_j \geq 0\).

    • The higher the weight, the higher the importance attached to it during the learning of a hypothesis.

  • Boosting

    • Begin at \(w_j=1\) for all examples. From this set, it generates the first hypothesis, \(h_1\).

    • Increase the weights for the misclassified examples and decrease the weights for the correctly classified examples to generate \(h_2\).

    • Continue this process until \(K\) hypothese are generated (where \(K\) is an input to the algorithm).

    • The final ensemble hypothesis is a weighted-majority combination of all \(K\) hypotheses, each weighted according to how well it performed on the training set.

  • ADABOOST

    • If the input learning algorithm \(L\) is a weak learning algorithm (\(L\) always return a hypothesis with accuracy that is slightly better than random guessing, \(50\,percent + \epsilon\).), then ADABOOST will return a hypothesis that classifies the training data perfectly for large enough \(K\).

  • Decision stumps: Original hypothesis space which are decision trees with just one test.