[section] [theorem] [theorem]Lemma [subsection]

Gradient Boosting Variation

Boosting Trees

\cite{schapire-adaBoost} In boosting methods, base estimators are built sequentially such as in Random Forests, but each new iteration tries to reduce the overall bias of the combined estimator.

The objective function differences minimization objectives.

Predictive power: how well the algorithm fits the data used for training and, hopefully, the true underlying distribution

+ Regularization, favors parsimonious models. This is because all models approximate natural/processes to some degree, so if a simpler model can be used with the same predictive power, then this model should be used.

\[\label{eq:objFunction} Obj(\Theta) \ = \ L(\Theta) + R(\Theta) \\ \]

When building a model of ensemble trees, a higher model will be learning on other weaker decision trees. If \(Tr\) is a set of tree models and \(K\) the number of trees in \(Tr\), then trees wil be the parameters for this model and \[y = \sum_k \gamma_t h_t(x) , \ h_t \in Tr \ \forall t \in {0...K}\]

where \(\gamma_t\) indexes the weight for each tree \(h_t\) and \(K\) is a hyper-parameter that represent the number of trees.

For the moment we include the tree’s weight \(\gamma_t\) as part of the weak learners in a single function \(f_t(\cdot)\). Therefore the model results in,

\[y = \sum_k f_t(x) , \ f_t \in Tr \ \forall t \in {0...K}\]

and a single tree is represented in the form

\[f(x) = \theta_{q(x)} = \sum_{j=1}^J \theta_j I(x \in R_j)\]

with \(\theta_j \in \mathbb{R} \ \forall j = 1,...,J\) and \( \coprod_{j=1}^J R_j\) is a partition of feature space. The function \(q : X \mapsto \{1,...,J\}\) denotes the mapping form samples to regions. In summary, \(\{\theta_j, R_j\}_{j=1}^J\) are the model’s parameters and \(J\) is a hyper-parameter. Note that finding the best partition of feature space is a non-trivial optimization problem since finding subset partitions satisfying a global condition is a combinatorial feat.

In this setting, the objective function would account for these relationships in the trees and thus we would have that

\[Obj(\Theta) = \sum_i^n l(y_i,\hat{y}^i)) + \sum_t R(f_t)\] \label{eq:boositing-objfunction} 1 At a higher level we would have that \(\Theta\) is a parameter encoding lower trees’ parameter information. If \(\Theta_t\) is the parameter asciated for each tree model \(f_t\) then \(\Theta = \bigcup_{t \in {0...K}} \Theta_t \cup \Theta_0\) where the parameter \(\Theta_0\) is not associated to any tree and reserved to characterize the tree assembly. An optimization routine will have to collectively fit all of the parameters in \(\Theta\) to learn this model. This is prohibitively costly in practice so optimization heuristics are used instead.

Additive Training

A first take on this optimization problem goes along the way of greedy algorithms. One tree is fit at a time and new trees are then successively added in later steps to improve on previous trees’ errors.

Consider \(t\) the step indexer of the algorithm, where \(t \in {0,..,K}\). In step \(t\), let \(Obj_t(\Theta)\) be the objective function and \(\hat{y}_t^i\) be the predictive target respectively. Then the \(i\)-eth target’s value at each step would iterate in the following way:

\[\label{eq:gb-targetSteps} \hat{y}_0^i = 0 \\ \ \ \ ... \\ \hat{y}_t^i = \sum_{k=1}^{t} f_k(x^i) = \hat{y}_{t-1}^i + f_t(x^i) \]

where each trees is added in such a way that we are minimizing

\[\begin{split} Obj_t(\Theta) \approx & \sum_i^n {g^i \theta_{q(x^i)} + \frac{1}{2} h^i \theta_{q(x^i)}^2 } + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T\theta_j^2 \\ = & \sum_{j=1}^T\left( \sum_{i \in T_j} (g^i )\theta_{j} + \frac{1}{2} \sum_{i \in T_j} (h^i + \lambda ) \theta_{j}^2 \right) + \gamma T \end{split}\] where \(c(t)\) and \(c'(t)\) are constants for fixed \(t\). Approximating by a second order Taylor approximation yields

\[\begin{split} Obj_t(\Theta) = & \sum_i^n l(y^i, \hat{y}_{t-1}^i + f_t(x^i) ) + c(t) + R(f_t) \\ = &\sum_i^n {l(y^i, \hat{y}_{t-1}^i) + g^i f_t(x^i) + \frac{1}{2} h^i f_t(x^i)^2 } + R(f_t) + c'(t) \end{split}\]

where \(g^i\) and \(h^i\) are first and second order approximations of the loss function. \[g^i = \frac{\partial l(y^i, \hat{y}_{t-1}^i)}{\partial \hat{y}_{t-1}^i} \\ h^i = \frac{\partial^2 l(y^i, \hat{y}_{t-1}^i)}{\partial (\hat{y}_{t-1}^i)^2 }\]

Then

\[Obj_t(\Theta) = \sum_i^n { g^i f_t(x^i) + \frac{1}{2} h^i f_t(x^i)^2 } + Obj_{t-1}(\Theta) + R(f_t) - R(f_{t-1})\]

In a greedy optimization approach, the tree \(f_t\) will minimize \(\sum_i^n { g^i f_t(x^i) + \frac{1}{2} h^i f_t(x^i)^2 } + R(f_t)\) at the \(t\)-th step. The assumptions with this approach are that \( \forall i \in {1...n}, \forall t \in {1..K}, \exists g^i(\hat{y}_{t}^i), h^i(\hat{y}_{t}^i) \) and that these values are effectively computable.

\(h^i\) on the existence

where a second order Taylor expansion was used

aggregating

sklearn

Hastie Tibshiranie Friedman


  1. In the formula \ref{eq:boositing-objfunction}