Authorea

[section] [theorem] [theorem]Lemma [subsection]

Random Forests: Formulation

Let \(K\) be the number of trees in the ensemble and let \(\Theta_k\) encode the parameters for the \(k\)-th tree. As we have mentioned before, there are various variants to the model and these variants will define the type encoding of the \(\Theta_k\) parameters. In this part, we will not specify the type of forest used since the following proofs apply to all.

We denote \(h(\textbf{x},\Theta_k)\) to be the corresponding classifier and \(N\) the number of samples in the training set. The creation of a random forest involves an iterative procedure where at the \(k\)-th step, the parameter \(\Theta_k\) is fit from the same distribution but independently of the previous parameters \(\Theta_1, \ ..., \ \Theta_{k-1}\).

Let \(\{h_k(\textbf{x})\}_{i=1}^K\) ¹ be a set of classifying trees and let \(I\) denote the indicator function. Define the margin function as

\[mg(\textbf{x},\textbf{y}) = \frac{1}{K} \sum_{k=1}^K I(h_k(\textbf{x}) = \textbf{y}) - max_{j\neq \textbf{y}}\left(\frac{1}{K} \sum_{k=1}^K I(h_k(\textbf{x}) = j) \right)\] \label{eq:rf-marginFun}

The margin function measures, in average, how much do the trees vote for the correct class in comparison to all other classes. It is the training error of the model when using the misclassification loss. Here the generalization error is denoted as \(PE*\) and is equal to \[P_{\textbf{x}, \textbf{y} }(mg(\textbf{x},\textbf{y}) <0)\]

It can be shown that, for \(K\) sufficiently large, the generalization error under the misclassification loss converges to

\[P_{\textbf{x}, \textbf{y} } ( P_{\Theta} (h(\textbf{x}, \Theta) = \textbf{y}) - max_{j \neq \textbf{y}} P_{\Theta} (h(\textbf{x}, \Theta) = j) < 0)\]

almost surely for all sequences of parameters \(\Theta_1, ..., \Theta_k,...\)

Proof

The proof follows from seeing that given a training set, a tree \(\Theta\) and a class \(j\) then \[\forall \textbf{x} \ P_\Theta(h(\theta,\textbf{x}) = j) \ = \ \lim_{L\to\infty} \frac{1}{L} \sum_{l=1}^K I(h_l(\textbf{x}) = j) \] almost surely.

This is because if we look at the nature of the tree, we see that the set \(\{\textbf{x} / h_l(\textbf{x}, \Theta) = j \}\) is built as a union of hyper-rectangles partitioning feature space. And given the finite size of the training set, there can be only a finite set of these unions of hyper-rectangles for all of the input data. Let \(S_1, ..., S_M\) be an indexation of these unions and define \(\phi(\Theta) = m \) if \(\{\textbf{x} / h(\textbf{x}, \Theta) = j \} = S_m\).

We denote by \(L_m\) the number of times that \(\phi(\Theta_l) =m \), where \(l \in {1...L}\) and \(L\) is the total number of trees of this forest.

It is immediate that

\[\label{rf-PEconvergence1} \frac{1}{L} \sum_{l=1}^L I(h_l(\textbf{x},\Theta) = j) \ = \ \frac{1}{L} \sum_{m=1}^M L_m I(\textbf{x} \in S_m)\]

and that following the law of large numbers, there is a convergence almost everywhere of \[\label{rf-PEconvergence2} \frac{L_m}{L} = \frac{1}{L} \sum_{l=1}^L I(\phi(\Theta_l) = m) \xrightarrow[L \to \infty]{} P_{\Theta}(\phi(\Theta)= m).\] If we let \(C = \) \(\bigcup\limits_{m=1}^{M} C_{m}\) where each \(C_m\) are zero-measured sets representing the points where the sequence is not converging. If we combine \ref{rf-PEconvergence1} and \ref{rf-PEconvergence2}, we will finally have that outside of \(C\),

\[\frac{1}{L} \sum_{l=1}^L I(h_l(\textbf{x}) = j) \xrightarrow[L \to \infty]{} \sum_m^M P_{\Theta}(\phi(\Theta)= m) I(\textbf{x} =j ) \ = \ P_{\Theta}(h(\textbf{x}, \Theta) = j)\]

Predictive error bounds

Random Forests are built upon a bag of weaker classifier, of which each individual estimator has a different prediction error. To build an estimate of the generalization error on the ensemble classifier, these individual scores and the irelationship between them must measured. In this sense, the strength and correlation of a Random Forest must be analyzed to arrive on an estimate of the generalization error.

There exists an upper bound for the generalization error.

Define \(\hat{j}(\textbf{x},\textbf{y})\) as \(arg max_{j\neq \textbf{y}} P_{\Theta}(h(\textbf{x}) = j)\) and let the margin function for a random forest (not a group of classifiers) be defined as

\[\label{eq:rf-marginFunRf} mr(\textbf{x},\textbf{y}) = P_{\Theta}(h(\textbf{x}) = \textbf{y}) - P_{\Theta}(h(\textbf{x}) = \hat{j}) \\ = {{\mathbb{E}}}_{\Theta} \left[ I(h(\textbf{x},\Theta ) = y ) - I( h( \textbf{x},\Theta ) = \hat{j} ) \right]\]

Here the margin function is described as the expectation taken over another function which is called the raw margin function\label{eq:rf-rawMarginFun}. Intuitively, the raw margin function takes each sample to be \(1\) or \(-1\) according to whether the ensemble classifier can correctly classify or not the sample’s label, given \(\Theta\).

With these definitions, it is straight to see that

\[mr( \textbf{x},\textbf{y} )^2 = {{\mathbb{E}}}_{\Theta, \Theta'} \left[ rmg( \Theta,\textbf{x},\textbf{y} ) \ rmg(\Theta',\textbf{x},\textbf{y} ) \right]\].

This in turn implies that

\[\label{eq:rf-marginFunVar} \begin{split} var(mr) & = {{\mathbb{E}}}_{\Theta, \Theta'} \left[ cov_{\textbf{x},\textbf{y}} (rmg(\Theta,\textbf{x},\textbf{y} )rmg(\Theta',\textbf{x},\textbf{y} )) \right] \\ & = {{\mathbb{E}}}_{\Theta, \Theta'} \left[ \rho(\Theta, \Theta')\sigma(\Theta)\sigma(\Theta') \right] \end{split}\]

where \( \rho(\Theta, \Theta')\) is the correlation between \(rmg(\Theta,\textbf{x},\textbf{y})\) and \(rmg(\Theta',\textbf{x},\textbf{y})\), and \(\sigma(\Theta)\) is the standard deviation of \(rmg(\Theta,\textbf{x},\textbf{y})\). In both cases, \(\Theta\) and \(\Theta'\) are given. Equation \ref{eq:rf-marginFunVar} in turn implies that

\[\label{eq:rf-varianceBound} \begin{split} var(mr) & = \overline{\rho} ({{\mathbb{E}}}_{\Theta}\left[ \sigma(\Theta)\right] )^2 \\ & \leq \overline{\rho} {{\mathbb{E}}}_{\Theta} \left[ var(\Theta) \right] \end{split}\]

where we have conveniently defined \(\overline{\rho}\) as

\[\label{eq:rf-meanCorrelation} \frac{{{\mathbb{E}}}_{\Theta, \Theta'} \left[ \rho(\Theta, \Theta') \sigma(\Theta) \sigma(\Theta')\right]} {{{\mathbb{E}}}_{\Theta, \Theta'} \left[ \sigma(\Theta) \sigma(\Theta')\right]}\]

Note that this is the mean value of the correlation.

Let the strength of the set of weak classifiers in the forest be defined as

\[s = {{\mathbb{E}}}_{\textbf{x},\textbf{y}} \left[ mr(\textbf{x},\textbf{y} ) \right]\].\label{eq:rf-strength}

Assuming that \(s \geq 0\) we have that the prediction error is bounded by \[\label{eq:rf-predictiveErrorBound1} PE^* \leq var(mr)/s^2\] by Chebyshev’s inequality. On the other we also have that

\[\label{eq:rf-expectedVarBound} \begin{split} {{\mathbb{E}}}_{\Theta} \left[ var(\Theta) \right] & \leq {{\mathbb{E}}}_{\Theta} \left[ {{\mathbb{E}}}_{\textbf{x},\textbf{y}}\left[ rmg(\Theta,\textbf{x},\textbf{y}) \right] \right]^2 -s^2 \\ & \leq 1-s^2 \end{split}\]

We can use \ref{eq:rf-varianceBound}, \ref{eq:rf-predictiveErrorBound1} and \ref{eq:rf-expectedVarBound} to establish the upper bound for the prediction error we are looking for \[\label{eq:rf-PEBound} PE^* \leq \overline{\rho}\frac{(1-s^2)}{s^2}\]

This bound on the generalization error shows the importance of the strength of each individual weak classifier in the forest and the correlation interdependence among them. The author of the algorithm \cite{breiman-randomforests} remarks here that this bound may not be strong. He also puts special importance on the ratio between the correlation and the strength \(\frac{\overline{\rho}}{s^2}\) where this should be as small as possible to build a strong classifier.

Binary Class

In the context of a binary class problem, where the target variable can only take two values, there are simplifications to the formula \ref{eq:rf-PEBound}. In this case, the margin function takes the form of \(2 P_{\Theta}(h(\textbf{x}) = \textbf{y}) -1\) and similarly the raw margin function results in \(2 I(h(\textbf{x}, \Theta) = \textbf{y}) -1\).

The bounds prediction error bounds derived in \ref{eq:rf-predictiveErrorBound1} assume that \(s >0\) which in this case results in \[{{\mathbb{E}}}_{\textbf{x},\textbf{y}} \left[ P_{\Theta}(h(\textbf{x}) = \textbf{y}) \right] > \frac{1}{2}\]

Also, the correlation between \(I(h(\textbf{x}, \Theta) = \textbf{y})\) and \(I(h(\textbf{x}, \Theta') = \textbf{y})\), denoted \(\overline{\rho}\) will take the form

\[\overline{\rho} = {{\mathbb{E}}}_{\Theta,\Theta'} \left[ \rho \left( h(\cdot{},\Theta) ,h(\cdot{},\Theta') \right) \right]\]

Other Notes on Random Forests

One benefit of building Random Forest classifiers is that the algorithm easily increases the prediction error of a group of estimators by randomly building each of these in a way that decreases the variance of the overall model whilst trading a small loss in bias. The model is also robust to the introduction of a number of noisy features. If the ratio of informative to non-informative features is not extreme, selecting \(m\) features at random at each split will mean that in most cases, splits will be made on those informative features. Note that in any given tree, the probability of drawing at least one informative feature in a split is still very high. This is because it follows a hypergeometric distribution \(\mathcal{H}(P,j,l)\) with \(l\) draws from a total population of \(P\) features and only \(j\) informative ones.

The depth of growth for each tree is another important tuning parameter. We must chose it correctly by assessing the model’s performance across different values for \(m\). A deep tree will tend to overfit the data by partitioning input space to fit the training data. This effect will counter the overall reduction in variance of the forest and thus increase the generalization error of our algorithm.

The algorithm also benefits from a heuristic to measure variable importance. A special modification in the way forests are built allows this to happen.

The idea for this is that at each split we can measure the gain of using a certain variable for the split versus not using it. Given a candidate feature \(X_j\) to be analyzed and for every node in a tree where a split is to be done, we compare the improvement in the split performance, as measured by some pre-selected criteria, with and without \(X_j\). These results are recorded and averaged across all trees and all of the split scenarios to have a score for the feature. The features with highest scores can be thought to be the most informative variables of the model.

There is an abuse of notation by noting trees as \(h_k(\textbf{x}\) and not \(h(\textbf{x}, \Theta_k)\)↩