EECS 492 Chapter 18: Learning From Examples

An agent is

**learning**if it improves its performance on future tasks after making observations about the world.This chapter: From a collection of input-output pairs, learn a function that predicts the output for new inputs.

Task

Given a**training set**of N example input-output pairs.

\((x_1, y_1), (x_2, y_2), ... (x_N, y_N)\)

where each \(y_j\) was generated by an unkown function \(y = f(x)\), discover a function \(h\) that approximates the true function \(f\).Function \(h\): Hypothesis.

Learning is a search through space of possible hypotheses for one that performs well, even on new examples.

To measure accuracy of a hypothesis, we use a

**test set**of examples distinct from the training set.A hypothesis

**generalizes**if it correctly predits the value of \(y\) for novel examples.Sometimes \(f\) is stochastic-not strictly a function of \(x\)- which means that we have to learn a conditional probability distribution \(P(Y|x)\)

Types of Learning Problems

Classification: Type of learning problem for which the output \(y\) is one of a finite set of values (such as \(sunny\), \(cloudy\), or \(rainy\)).

Regression: Type of learning problem for which the output \(y\) is a number (such as temperature).

**Ockham’s Razor**: Choose the*simplest*hypothesis consistent with the data.“In general, there is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better.”

Supervised learning is done by choosing the hypothesis \(h^*\) that is most probably given the data:

\(h^*=argmax_{h\in H}\,P(h|data).\)

By Bayes: \(h^*=argmax_{h\in H}\,P(data|h)P(h).\)

\(P(h)\) is high for a degree 1/2 polynomial and low for a higher degree polynomial.

A

**decision tree**represents a function that takes as input a vector of attribute values and returns a “decision”–a single output value.Problems with discrete values: Each example input will be classified as true (

**positive**example) or false (**negative**example).Each node corresponds to a test of the value of one of the input attributes, \(A_i\), and the branches are labeled with possible values for that attribute, \(A_i = v_{ik}\).

Each leaf in the tree specifies a value returned by the function.

A boolean decision tree is logically equivalent to the assertion that the goal attribute is true if and only if the input attributes satisfy one of the paths leading to a leaf with value \(true\).

\(Goal \Leftrightarrow (Path_1 \vee Path_2 \vee ...). \)

Where each path is also a propositional logic statement:

\(Path = (Patrons = Full \wedge WaitEstimate = 0-10).\)

(x,y)

x (input): vector of values of input attributes

y (output): single Boolean output value

No way to find smallest consistent tree with all training set values–\(2^{2^n}\) trees (2^n distinct input attributes, 2^n output values).

Decision Tree Learning Algorithm

Idea

Greedy divide and conquer algorithm.

Always test the more “important attribute” first. In other words, the attribute that makes the most difference to the classification of the sample to reduce number of tests.

Divide into smaller subproblems that can then be solved recursively.

In general, after the first attribute test splits up the examples, each outcomei s a new decision tree learning problem, with fewer examples and one less attribute

Recursive Considerations

If remaining examples are all positive/negative, then we are done (can answer \(Yes\) or \(No\)).

If there are some postiive/negative examples, then choose best attribute to split them.

If there are no examples left, it means that no example has been observed for this combination of attribute values, and we return a default value calculated from the plurality of classifications of the examples that were used in constructing the node’s parent (passed through \(parent_examples\)).

If there are no attributes left, but positive and negative examples, it means that these examples have exactly the same description but different classifications. Return the plurality classification of remaining examples.

Algorithm

Danger: Over-interpreting the tree that the algorithm selects.

Evaluate accuracy with a

**learning curve**. As training set grows, the accuracy increases.

So far: Looked at learning methods in which a single hypothesis, chosen from a hypothesis space, is used to make predictions

**Ensemble Learning**: Select a collection, or**ensemble**, of hypotheses from the hypothesis space and combine their predictions.Motivation: Consider we have an ensemble of K=5 hypotheses and suppose we combine their predictions using simple majority voting. For the ensemble to misclassify a new example,

*at least three of the five hypotheses have to misclassify it*. That is, there is a lower chance of misclassification than with a single hypothesis.Suppose each hypothesis \(h_k\) in the ensemble has an error of \(p\) (the probability that a randomly chosen example is misclassified by \(h_k\) is \(p\).

Suppose that the errors made by each hypothesis are

*independent*.

**Weighted Training Set**Each example has an associated weight \(w_j \geq 0\).

The higher the weight, the higher the importance attached to it during the learning of a hypothesis.

Boosting

Begin at \(w_j=1\) for all examples. From this set, it generates the first hypothesis, \(h_1\).

Increase the weights for the misclassified examples and decrease the weights for the correctly classified examples to generate \(h_2\).

Continue this process until \(K\) hypothese are generated (where \(K\) is an input to the algorithm).

The final ensemble hypothesis is a weighted-majority combination of all \(K\) hypotheses, each weighted according to how well it performed on the training set.

ADABOOST

If the input learning algorithm \(L\) is a

**weak learning**algorithm (\(L\) always return a hypothesis with accuracy that is slightly better than random guessing, \(50\,percent + \epsilon\).), then ADABOOST will return a hypothesis that classifies the training data perfectly for large enough \(K\).

**Decision stumps**: Original hypothesis space which are decision trees with just one test.