6.5in 1.0in 1.0in 8.5in

1. Learning Decision Trees

1.A.
\[Entropy = p_- * \log p_- - p_+ \log p_-\] \[Entropy(S) = \frac{35}{50}* \log \frac{35}{50} - \frac{15}{50}+ \log \frac{15}{50}\] \[Entropy(S) = 0.8813\] \[Gain(S, Holiday) = Entropy(S) - \sum_{v = Type\ Day}^{All\ Day\ Types}\frac{S_v}{S}*Entropy(S_v)\] \[Gain(S, Holiday) = Entropy(S) - (\frac{15}{50}*Entropy(S_{Yes\ Holiday}) + \frac{35}{50} * Entropy(S_{No\ Holiday}))\] \[Gain(S, Holiday) = 0.88129 - (\frac{15}{50} * 0.9183 + \frac{35}{50} * 0.5917)\] \[Gain(S, Holiday) = 0.19163\]
\[Gain(S, Exam\ Tomorrow) = Entropy(S) - \sum_{v = Exam\ Status}^{All\ Types}\frac{S_v}{S}*Entropy(S_v)\] \[Gain(S, Exam\ Tomorrow) = Entropy(S) - (\frac{16}{50}*Entropy(S_{Yes\ Exam}) + \frac{34}{50} * Entropy(S_{No\ Exam}))\] \[Gain(S, Exam\ Tomorrow) = 0.88129 - (\frac{16}{50} * 0.9852281 + \frac{34}{50} * 0.353359)\] \[Gain(S, Exam\ Tomorrow) = 0.32573\]
Because the gain on the Exam attribute is higher than the Holiday one, the Exam attribute is the root node of the tree.

1.B. Redefine information gain using MajorityError as the measure of impurity and use this to represent the data as a decision tree.
For the root node: \[Gain(S, Color) = MajError(S) - \sum_{v = Color\ Type}^{All\ Types}\frac{S_v}{S}*MajError(S_v)\] \[Gain(S, Color) = \frac{7}{16} - (\frac{8}{16}*(\frac{3}{8}) + \frac{8}{16} * (\frac{2}{8}) )\] \[Gain(S, Color) = 0.125\] \[Gain(S, Size) = \frac{7}{16} - (\frac{8}{16}*(\frac{3}{8}) + \frac{8}{16} * (\frac{2}{8}) )\] \[Gain(S, Size) = 0.125\] \[Gain(S, Act) = \frac{7}{16} - (\frac{8}{16}*(\frac{3}{8}) + \frac{8}{16} * (\frac{2}{8}) )\] \[Gain(S, Act) = 0.125\]
As all attributes have equal information gain, we can use any of them for the root node. I choose Color.
Having shown I understand how to calculate information gain and select the appropriate attribute at each node, I carry on with this procedure until I have the following tree:

\% My Decision Tree based on the previous data is: if Color = Yellow: if Size = Small: Inflated = T if Size = Large: if Act = Dip: Inflated = F if Act = Stretch: if Age = Adult: Inflated = T if Age = Child: Inflated = F if Color = Purple: if Act = Dip: Inflated = F if Act = Stretch: if Age = Adult: Inflated = T if Age = Child: Inflated = F

1.C. [6 points] Does ID3 find a globally optimal decision tree? Justify your answer shortly.
No, it is not guaranteed to find the globally optimal decision tree. This is because in events of information gain parity amongst attributes (as was the case in my shown calculations for the root of this tree), one attribute must be picked arbitrarily. Thus, it is possible that one of the attributes that wasn’t picked could have lead to a more optimal decision tree based on better 2nd level and 3rd level node gains.

2. Decision Trees as Features
2.A. Feature Generation
Even though I ended up not using the WEKA or Java files, I did implement a file which converts the relevant data files into an ARRF data-format. This program is included in my “Code” section.

Report
Highest Ranking Algorithm: Stumps as Features
The feature set consists of 100 tree stumps.
Each of these stumps is of depth 4.
In the SGD process, a linear separator is created for weighing the “votes” of each 100 stumps.
Parameters: Learning rate of 3E-4, Error Threshold: 1E-3.
2nd Ranking Algorithm: Stumps of depth 4
The feature set consists of the 260 letters (26 potential letters for the first 5 letters of both the first and last names) and 2 lengths (for the first name and last name).
Each stump is of depth 4.
50% of the training set was used for training each stump.

3rd Ranking Algorithm: Unbounded Decision Tree
The features set consists of teh 260 letters and 2 lengths.
Each tree can have unlimited depth.
4th Ranking Algorithm: Stumps of depth 4
The feature set consists of the 260 letters and 2 length.
Each stump is of depth 8.
50% of the training set was used for training each stump.

5th Ranking Algorithm: Regular SGD
The feature set consists of the 260 letters and 2 length.
The learning rate is 5E-5. The error threshold is 1E-3.

LearningAlgorithms \(P_a\) 99% Conf Interval Statistically BetterThan Following
1.StumpFeaturesSGD 0.67002 t-value of 0.2805 Not at p < 0.01
2.Stumpsof4 0.656458 t-value of 0.1534 Not at p < 0.01
3.DecisionTree 0.64644 t-value of 0.0457 Not a p < 0.01
4.Stumpsof8 0.643022 t-value of 1.2554 Not at p < 0.01
5.SimpleSGD 0.57844 x X

Important note -> although none of the algorithms is 99% statistically better from the previous to the next best ranking, from the worst (Simple SGD) to the best (StumpFeaturesSGD), there is 98.5% confidence interval.
Conclusions: The stumps as features algorithm performed the best because it has the highest \(p_a\) value. The reason why I feel the ranking turned out the way it did is due, in part, to overfitting. When trees are really deep, they can overfit the training data to the expense of the testing data classification. However, this does not completely coincide with what I found, because the unbounded decision tree was about on par with the stumps of depth 8.
In addition, I believe the reason the SGD algorithm with the features as stumps was superior because it got around the local vs. global optimization problem of just simple tree algorithms. By using multiple trees, it uses multiple greedy optimizations to find a more “globally” good tree structure. Then, it weighs the best trees and unweighs the worse trees. This is why I think it did the best. If I had included a larger number of folds, I believe this algorithm would have proven (to a high statistically significant degree) the best. For the SGD, I found that if I set the learning rate too high, my error oscillated out of control. This could be from “overshooting” the target weight values on switch from a negative to positive point.

Tree Displays:
The following cross-validations were the best for when the given Fold was the testing set:
Decision Tree: Fold 2 -> 42 correct, 17 incorrect.
Stump of 8: Fold 3 -> 46 correct, 13 incorrect.
Stump of 4: Fold 3 -> 43 correct, 16 incorrect.
Stumps as Features: Fold 1 -> 45 correct, 14 incorrect.
Important Point -> The trees for the unbounded tree, Stump of 8, and Stump of 4 are included in the zip file as word documents.