ROUGH DRAFT authorea.com/83259
Main Data History
Export
Show Index Toggle 0 comments
  •  Quick Edit
  • STATISTICAL LEARNING THEORY NOTES

    ERM Principle Consistency

    Denote as \[R_{emp}(a_l) \triangleq \inf_{a \in \Lambda}R_{emp}(a_l)=\inf_{a \in \Lambda}\frac{1}{l}\sum_{i=1}^{l}Q(z_i,a)\]

    Definition. ERM principle is consistent for the set of functions \(Q(z,a)\), \(a \in \Lambda\) and the probability distribution \(F(z)\) if the following two sequences converge in probability: \[R(a_l) \xrightarrow[l \rightarrow \infty]{P} \inf_{a \in \Lambda}R(a)\] \[R_{emp}(a_l) \xrightarrow[l \rightarrow \infty]{P} \inf_{a \in \Lambda}R(a)\]

    Definition. ERM principle is strictly consistent for the set of functions \(Q(z,a)\), \(a \in \Lambda\) and the probability distribution \(F(z)\) if for any non empty subset \(\Lambda(c)\), \(c \in (-\infty, +\infty)\) such that \[\Lambda(c) \triangleq \bigg \{a: \int Q(z,a) dF(z) \geq c \bigg \}\] the convergence \[\label{eq:strict_consistency} \inf_{a \in \Lambda(c)}R_{emp}(a) \xrightarrow[l \rightarrow \infty]{P} \inf_{a \in \Lambda(c)}R(a)\] is valid.

    Lemma 1.1. If ERM principle is strictly consistent then the following convergence in probability holds: \[\label{eq:cons_lemma} R(a_l) \xrightarrow[l \rightarrow \infty]{P} \inf_{a \in \Lambda}R(a)\] Proof. Denote \[\label{eq:T} T \triangleq \inf_{a \in \Lambda}R(a)\] Then, due to (\ref{eq:strict_consistency}) \[\label{eq:cons_prob_1} P\bigg\{ \inf_{a \in \Lambda}R_{emp}(a) - \inf_{a \in \Lambda}R(a) \geq \frac{\epsilon}{2} \bigg\} \xrightarrow[l \rightarrow \infty]{} 0\] and \[\label{eq:cons_prob_0} P\bigg\{\inf_{a \in \Lambda(T+\epsilon)}R_{emp}(a) - \inf_{a \in \Lambda}R(a) \geq \frac{\epsilon}{2} \bigg\} \xrightarrow[l \rightarrow \infty]{} 1\] are valid for any \(\epsilon > 0\). (\ref{eq:cons_prob_1}) and (\ref{eq:cons_prob_0}) imply that \(a_l \notin \Lambda(T+\epsilon)\). Thus, from the definition of subsets \(\Lambda(c)\) the following inequalities \[T \leq \int Q(z,a_l) df(z) \leq T+\epsilon\] hold.

    Convergence in (\ref{eq:strict_consistency}) always implies convergence in (\ref{eq:cons_lemma}) but not vice versa.

    Key Theorem of Statistical Learning Theory

    The relationship \[P\bigg\{\sup_{a\in \Lambda} \bigg |\int Q(z,a)dF(z) - \frac{1}{l}\sum_{i=1}^{l}Q(z_i,a) \bigg | > \epsilon \bigg \} \xrightarrow[l \rightarrow \infty]{} 0,\] for any \(\epsilon > 0\), is called uniform convergence of means to their mathematical expectations over a given set of functions. Similarly, the relationship \[P\bigg\{\sup_{a\in \Lambda} \bigg (\int Q(z,a)dF(z) - \frac{1}{l}\sum_{i=1}^{l}Q(z_i,a) \bigg ) > \epsilon \bigg \} \xrightarrow[l \rightarrow \infty]{} 0,\] for any \(\epsilon > 0\), is called uniform one-sided convergence of means to their mathematical expectations over a given set of functions.

    Theorem 1.1 (Key Theorem). Let there exist the constants \(\beta\) and \(B\) such that for all functions \(Q(z,a)\), \(a \in \Lambda\) and for a given distribution \(F(z)\) the inequalities \[\beta \leq \int Q(z,a)dF(z) \leq B , \:\: a \in \Lambda\] are valid.

    Then the following two statements are equivalent:

    1. For the given distribution function \(F(z)\), the ERM principle is strictly consistent on the set of functions \(Q(z,a)\), \(a \in \Lambda\).

    2. For the given distribution function \(f(z)\), the uniform one-sided convergence of the means to their mathematical expectations takes place over the set of functions \(Q(z,a)\), \(a \in \Lambda\).

    Proof. Let the ERM principle be strictly consistent. Then, for any non empty subset \(\Lambda(c)\) the following convergence in probability \[\label{eq:cons_in_prob} P \bigg \{ \inf_{a \in \Lambda(c)} \int Q(z,a)dF(z) - \inf_{a \in \Lambda(c)} \frac{1}{l} \sum_{i=1}^{l} Q(z_i,a) > \frac{\epsilon}{2} \bigg \} \xrightarrow[l \rightarrow \infty]{} 0\] is valid for any \(\epsilon > 0\). Then, we split the range of the functions \(Q(z,a)\), \(a \in \Lambda\) using a finite sequence of numbers \(\beta_1, \beta_2, \cdots \beta_n\) such that \[\label{eq:cons_betas} |\beta_{i+1} - \beta_i | < \frac{\epsilon}{2}, \:\:\: \beta_1=\beta, \beta_n = B.\] We denote as \(T_k\) the event \[\label{eq:cons_eventTk} \inf_{a \in \Lambda(\beta_k)} \int Q(z,a)dF(z) - \inf_{a \in \Lambda(\beta_k)} \frac{1}{l} \sum_{i=1}^{l} Q(z_i,a) > \frac{\epsilon}{2}\] and as \(T\) the event \(\cup_{k=1}^nT_k\). Then by virtue of (\ref{eq:cons_in_prob}) \[P(T) = P(\cup_{k=1}^n T_k) \leq \sum_{k=1}^n P(T_k) \xrightarrow[l \rightarrow \infty]{} 0.\]

    In order to show that strict consistency of ERM principle implies uniform convergence of means to their mathematical expectations, the event \[A \triangleq \sup_{a \in \Lambda} \bigg(\int Q(z,a)dF(z) - \frac{1}{l}\sum_{i=1}^{l}Q(z_i,a) \bigg) > \epsilon\] must converge in probability to zero. Consider that event \(A\) takes place, then there will be \(a^* \in \Lambda\) such that \[\label{eq:cons_AtakesPlace} \int Q(z,a^*)dF(z) - \frac{1}{l}\sum_{i=1}^{l}Q(z_i,a^*) > \epsilon.\] Due to (\ref{eq:cons_betas}) we can find \(\beta_k\) such that \(a^* \in \Lambda(\beta_k)\) and \[\int Q(z,a^*)dF(z) - \beta_k < \frac{\epsilon}{2}.\] For the chosen set \(\Lambda(\beta_k)\) the inequality \[\label{eq:cons_eventA} \int Q(z,a^*)dF(z) - \inf_{a \in \Lambda(\beta_k)} \int Q(z,a)dF(z) < \frac{\epsilon}{2}\] is valid.

    Using (\ref{eq:cons_AtakesPlace}) and (\ref{eq:cons_eventA}) the inequalities \[\label{eq:cons_AleqTk} \begin{split} \inf_{a \in \Lambda(\beta_k)} \int Q(z,a)dF(z) + \frac{\epsilon}{2} & > \int Q(z,a^*)dF(z) \\ > \frac{1}{l}\sum_{i=1}^{l}Q(z_i. a^*) + \epsilon & > \inf_{a \in \Lambda(\beta_k)} \frac{1}{l} \sum_{i=1}^{l} Q(z_i, a) + \epsilon \end{split}\] hold and implie that event \(T_k\) also takes place. This means that \[A \subseteq T_k \subseteq T \Longrightarrow P(A) \leq P(T_k) \leq P(T) \xrightarrow[l \rightarrow \infty]{} 0 .\] The first part of the theorem is proven.

    For proving the second part of the theorem, we have to show that whenever uniform one-sided convergence of mean to their mathematical expectations takes place, the ERM principle is strictly consistent, that is, that for any \(\epsilon\) the convergence \[P\bigg \{ \bigg | \inf_{a \in \Lambda(c)} \int Q(z,a)dF(z) - \inf_{a \in \Lambda(c)} \frac{1}{l} \sum_{i=1}^{l}Q(z_i,a)\bigg | > \epsilon \bigg \} \xrightarrow[l \rightarrow \infty]{} 0\] holds. The event \[A \triangleq \bigg | \inf_{a \in \Lambda(c)} \int Q(z,a)dF(z) - \inf_{a \in \Lambda(c)} \frac{1}{l} \sum_{i=1}^{l}Q(z_i,a)\bigg | > \epsilon\] is the union of events \[A_1 \triangleq \inf_{a \in \Lambda(c)} \int Q(z,a)dF(z) + \epsilon < \inf_{a \in \Lambda(c)} \frac{1}{l} \sum_{i=1}^{l}Q(z_i,a)\] and \[A_2 \triangleq \inf_{a \in \Lambda(c)} \int Q(z,a)dF(z) - \epsilon > \inf_{a \in \Lambda(c)} \frac{1}{l} \sum_{i=1}^{l}Q(z_i,a).\]

    Our goal is to show that the event \(A\) converges in probability to zero. Towards this direction we have to bound the probabilities of events \(A_1\) and \(A_2\). Consider that event \(A_1\) takes place. We fix \(a^*\) such that \[\int Q(z,a^*)dF(z) + \frac{\epsilon}{2} < \inf_{a \in \Lambda(c)} \int Q(z,a)dF(z) + \epsilon.\] Then the following inequalities \[\begin{split} \int Q(z,a^*)dF(z) + \frac{\epsilon}{2} & < \inf_{a \in \Lambda(c)} \int Q(z,a)dF(z) + \epsilon \\ < \inf_{a \in \Lambda(c)} \frac{1}{l}\sum_{i=1}^{l}Q(z_1, a) & < \frac{1}{l}\sum_{i=1}^{l}Q(z_1, a^*) \end{split}\] hold. Therefore, \[A_1 \subseteq \bigg \{ z: \frac{1}{l}\sum_{i=1}^{l}Q(z_1, a^*) - \int Q(z,a^*)dF(z) > \frac{\epsilon}{2} \bigg \}\] and \[P(A_1) \leq P \bigg \{ \frac{1}{l}\sum_{i=1}^{l}Q(z_1, a^*) - \int Q(z,a^*)dF(z) > \frac{\epsilon}{2} \bigg \} \xrightarrow[l \rightarrow \infty]{} 0.\] The RHS convergence is valid due to the Law Of Large Numbers (\(a^*\) is fixed). For bounding the probability of \(A_1\) event we did not consider that uniform one-sided convergence takes place. We just used the law of large numbers.

    Now consider that event \(A_2\) takes place. We can find \(a^{**}\) such that the inequality \[\frac{1}{l} \sum_{i=1}^{l}Q(z, a^{**}) < \inf_{a \in \Lambda(c)} \frac{1}{l} \sum_{i=1}^{l}Q(z_i, a) + \frac{\epsilon}{2}\] is valid. Then, the following inequalities \[\begin{split} \frac{1}{l} \sum_{i=1}^{l}Q(z, a^{**}) + \frac{\epsilon}{2} & < \inf_{a \in \Lambda(c)} \frac{1}{l} \sum_{i=1}^{l} Q(z_i, a) +\epsilon \\ < \inf_{a \in \Lambda(c)} \int Q(a,z) dF(z) & < \int Q(z, a^{**})dF(z) \begin{split}\] hold. Therefore, \[\begin{split} P(A_2) & \leq P\bigg\{ \int Q(z, a^{**})dF(z) - \frac{1}{l} \sum_{i=1}^{l}Q(z, a^{**}) > \frac{\epsilon}{2} \bigg\} \\ & leq P\bigg\{sup_{a \in \Lambda} \bigg ( \int Q(z, a^{**})dF(z) - \frac{1}{l} \sum_{i=1}^{l}Q(z, a^{**}) \bigg ) > \frac{\epsilon}{2} \bigg\} \xrightarrow[l \rightarrow \infty]{} 0 \end{split}\] due to uniform one-sided convergence of means to their mathematical expectations. Because, \[A = A_1 \cup A_2 \Longrightarrow P(A) \leq P(A_1) + P(A_2)\] we have that \[P(A) \xrightarrow[l \rightarrow \infty]{} 0.\] The theorem is proven.