# Introduction

• An agent is learning if it improves its performance on future tasks after making observations about the world.

• This chapter: From a collection of input-output pairs, learn a function that predicts the output for new inputs.

# Supervised Learning

Given a training set of N example input-output pairs.
$$(x_1, y_1), (x_2, y_2), ... (x_N, y_N)$$
where each $$y_j$$ was generated by an unkown function $$y = f(x)$$, discover a function $$h$$ that approximates the true function $$f$$.

• Function $$h$$: Hypothesis.

• Learning is a search through space of possible hypotheses for one that performs well, even on new examples.

• To measure accuracy of a hypothesis, we use a test set of examples distinct from the training set.

• A hypothesis generalizes if it correctly predits the value of $$y$$ for novel examples.

• Sometimes $$f$$ is stochastic-not strictly a function of $$x$$- which means that we have to learn a conditional probability distribution $$P(Y|x)$$

• Types of Learning Problems

• Classification: Type of learning problem for which the output $$y$$ is one of a finite set of values (such as $$sunny$$, $$cloudy$$, or $$rainy$$).

• Regression: Type of learning problem for which the output $$y$$ is a number (such as temperature).

• Ockham’s Razor: Choose the simplest hypothesis consistent with the data.

• “In general, there is a tradeoff between complex hypotheses that fit the training data well and simpler hypotheses that may generalize better.”

• Supervised learning is done by choosing the hypothesis $$h^*$$ that is most probably given the data:
$$h^*=argmax_{h\in H}\,P(h|data).$$
By Bayes: $$h^*=argmax_{h\in H}\,P(data|h)P(h).$$
$$P(h)$$ is high for a degree 1/2 polynomial and low for a higher degree polynomial.

# Learning Decision Trees

## Decision Tree Representation

• A decision tree represents a function that takes as input a vector of attribute values and returns a “decision”–a single output value.

• Problems with discrete values: Each example input will be classified as true (positive example) or false (negative example).

• Each node corresponds to a test of the value of one of the input attributes, $$A_i$$, and the branches are labeled with possible values for that attribute, $$A_i = v_{ik}$$.

• Each leaf in the tree specifies a value returned by the function.

## Expressiveness of Decisions Trees

• A boolean decision tree is logically equivalent to the assertion that the goal attribute is true if and only if the input attributes satisfy one of the paths leading to a leaf with value $$true$$.
$$Goal \Leftrightarrow (Path_1 \vee Path_2 \vee ...).$$
Where each path is also a propositional logic statement:
$$Path = (Patrons = Full \wedge WaitEstimate = 0-10).$$

## Inducing Decision Trees from Examples

(x,y)

• x (input): vector of values of input attributes

• y (output): single Boolean output value

No way to find smallest consistent tree with all training set values–$$2^{2^n}$$ trees (2^n distinct input attributes, 2^n output values).

• Decision Tree Learning Algorithm

• Idea

• Greedy divide and conquer algorithm.

• Always test the more “important attribute” first. In other words, the attribute that makes the most difference to the classification of the sample to reduce number of tests.

• Divide into smaller subproblems that can then be solved recursively.

• In general, after the first attribute test splits up the examples, each outcomei s a new decision tree learning problem, with fewer examples and one less attribute

• Recursive Considerations

• If remaining examples are all positive/negative, then we are done (can answer $$Yes$$ or $$No$$).

• If there are some postiive/negative examples, then choose best attribute to split them.

• If there are no examples left, it means that no example has been observed for this combination of attribute values, and we return a default value calculated from the plurality of classifications of the examples that were used in constructing the node’s parent (passed through $$parent_examples$$).

• If there are no attributes left, but positive and negative examples, it means that these examples have exactly the same description but different classifications. Return the plurality classification of remaining examples.

• Algorithm  

• Danger: Over-interpreting the tree that the algorithm selects.

• Evaluate accuracy with a learning curve. As training set grows, the accuracy increases.

# Ensemble Learning 18.10 p748

• So far: Looked at learning methods in which a single hypothesis, chosen from a hypothesis space, is used to make predictions

• Ensemble Learning: Select a collection, or ensemble, of hypotheses from the hypothesis space and combine their predictions.

• Motivation: Consider we have an ensemble of K=5 hypotheses and suppose we combine their predictions using simple majority voting. For the ensemble to misclassify a new example, at least three of the five hypotheses have to misclassify it. That is, there is a lower chance of misclassification than with a single hypothesis.

• Suppose each hypothesis $$h_k$$ in the ensemble has an error of $$p$$ (the probability that a randomly chosen example is misclassified by $$h_k$$ is $$p$$.

• Suppose that the errors made by each hypothesis are independent.

## Boosting

• Weighted Training Set

• Each example has an associated weight $$w_j \geq 0$$.

• The higher the weight, the higher the importance attached to it during the learning of a hypothesis.

• Boosting

• Begin at $$w_j=1$$ for all examples. From this set, it generates the first hypothesis, $$h_1$$.

• Increase the weights for the misclassified examples and decrease the weights for the correctly classified examples to generate $$h_2$$.

• Continue this process until $$K$$ hypothese are generated (where $$K$$ is an input to the algorithm).

• The final ensemble hypothesis is a weighted-majority combination of all $$K$$ hypotheses, each weighted according to how well it performed on the training set.

• If the input learning algorithm $$L$$ is a weak learning algorithm ($$L$$ always return a hypothesis with accuracy that is slightly better than random guessing, $$50\,percent + \epsilon$$.), then ADABOOST will return a hypothesis that classifies the training data perfectly for large enough $$K$$.