EECS 492 Chapter 18: Learning From Examples

An agent is

**learning**if it improves its performance on future tasks after making observations about the world.This chapter: From a collection of input-output pairs, learn a function that predicts the output for new inputs.

Task

Given a**training set**of N example input-output pairs.

\((x_1, y_1), (x_2, y_2), ... (x_N, y_N)\)

where each \(y_j\) was generated by an unkown function \(y = f(x)\), discover a function \(h\) that approximates the true function \(f\).Function \(h\): Hypothesis.

Learning is a search through space of possible hypotheses for one that performs well, even on new examples.

To measure accuracy of a hypothesis, we use a

**test set**of examples distinct from the training set.A hypothesis

**generalizes**if it correctly predits the value of \(y\) for novel examples.Sometimes \(f\) is stochastic-not strictly a function of \(x\)- which means that we have to learn a conditional probability distribution \(P(Y|x)\)

Types of Learning Problems

Classification: Type of learning problem for which the output \(y\) is one of a finite set of values (such as \(sunny\), \(cloudy\), or \(rainy\)).

Regression: Type of learning problem for which the output \(y\) is a number (such as temperature).

**Ockham’s Razor**: Choose the*simplest*hypothesis consistent with the data.“In general, there is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better.”

Supervised learning is done by choosing the hypothesis \(h^*\) that is most probably given the data:

\(h^*=argmax_{h\in H}\,P(h|data).\)

By Bayes: \(h^*=argmax_{h\in H}\,P(data|h)P(h).\)

\(P(h)\) is high for a degree 1/2 polynomial and low for a higher degree polynomial.

A

**decision tree**represents a function that takes as input a vector of attribute values and returns a “decision”–a single output value.Problems with discrete values: Each example input will be classified as true (

**positive**example) or false (**negative**example).Each node corresponds to a test of the value of one of the input attributes, \(A_i\), and the branches are labeled with possible values for that attribute, \(A_i = v_{ik}\).

Each leaf in the tree specifies a value returned by the function.

A boolean decision tree is logically equivalent to the assertion that the goal attribute is true if and only if the input attributes satisfy one of the paths leading to a leaf with value \(true\).

\(Goal \Leftrightarrow (Path_1 \vee Path_2 \vee ...). \)

Where each path is also a propositional logic statement:

\(Path = (Patrons = Full \wedge WaitEstimate = 0-10).\)

(x,y)

x (input): vector of values of input attributes

y (output): single Boolean output value

No way to find smallest consistent tree with all training set values–\(2^{2^n}\) trees (2^n distinct input attributes, 2^n output values).

Decision Tree Learning Algorithm

Idea

Greedy divide and conquer algorithm.

Always test the more “important attribute” first. In other words, the attribute that makes the most difference to the classification of the sample to reduce number of tests.

Divide into smaller subproblems that can then be solved recursively.

In general, after the first attribute test splits up the examples, each outcomei s a new decision tree learning problem, with fewer examples and one less attribute

Recursive Considerations

If remaining examples are all positive/negative, then we are done (can answer \(Yes\) or \(No\)).

If there are some postiive/negative examples, then choose best attribute to split them.

If there are no examples left, it means that no example has been observed for this combination of attribute values, and we return a default value calculated from the plurality of classifications of the examples that were used in constructing the node’s parent (passed through \(parent_examples\)).

If there are no attributes left, but positive and negative examples, it means that these examples have exactly the same description but different classifications. Return the plurality classification of remaining examples.

Algorithm

Danger: Over-interpreting the tree that the algorithm selects.

Evaluate accuracy with a

**learning curve**. As training set grows, the accuracy increases.

So far: Looked at learning methods in which a single hypothesis, chosen from a hypothesis space, is used to make predictions

**Ensemble Learning**: Select a collection, or**ensemble**

## Share on Social Media