- When a random variable \(X\) can be written as a finite linear combination \(X = a_1 1_{E_1} + \cdots + a_n 1_{E_n}\) of indicator random variables with mutually disjoint events \(E_1, \dots ,E_n\), its power \(X^r\) has the particularly simple form \[X^r = a_1^r 1_{E_1} + \cdots + a_n^r 1_{E_n}.\]Hence the variance \[\mathbf{E}X^r = a_1^r p_1 + \cdots a_n^r p_n\]where \(p_i\) is the probability of each of the events \(E_i\).
- The aesthetic identity \[(1 - p)^2 p + p^2(1-p) = p(1 - p).\]
- If \(X\) is a Bernoulli random variable with probability \(p\), then \(X = 1_{X = 1}\) and the mean is the probability itself: \(\mathbf{E}X = p\); Further, \(|X - p|^2 = (1-p)^2 1_{X = 1} + p^2 1_{X = 0}\) and the variance is \(\mathbf{E}|X - p|^2 = p(1-p)\).
- Let \(X_1, X_2, \dots\) iid Bernoulli variables with probability \(p\). The empirical mean \(S_n/n\) is a random variable taking the value \(\frac{i}{n}, i=0, \dots ,n\) with probability \[\mathbf{P}(\frac{X_1 + \cdots + X_n}{n} = \frac{i}{n}) = \binom{n}{i}p^i(1-p)^{n-i}\]; for instance \(\mathbf{P}(X_1 = 1 \wedge \cdots \wedge X_i = 1 \wedge X_{i+1} = 0 \wedge \cdots \wedge X_n = 0) = p^i(1-p)^{n-i}\) and similarly for permutations.
- Monotonicity of probability: the inclusion of events \(E \subset F\) implies the hierarchy of \(\mathbf{P}(E) \leq \mathbf{P}(F)\).
- Given a random variable \(X\) and two disjoint events \(E, F\) with \(\mathbf{P}(E \vee F) = 1\), if \(|X|_E| \leq M_1\) and \(|X|_F| \leq M_2\) then one can bound the mean \(\mathbf{E}|X|\) by \[\mathbf{E}|X| \leq p M_1 + (1-p) M_2\] where \(p, 1-p \) are respectively the probabilities of \(E, F\). This can easily generalize to an arbitrary finite number of mutually disjoint events.
- A sequence \(f_1, f_2, \dots\) of functions is uniformly convergent if it is convergent in the space of functions; Note the difference that it is pointwisely convergent if the sequence \(f_1(p), f_2(p), \dots\) is convergent in the space of points at a particular point \(p\).
Proposition 5 (Approximation by Bernstein polynomials) Let \(f: [0, 1] \rightarrow \mathbf{R} \) be a continuous function. Then, the Bernstein polynomials \[f_n(t) := \sum_{i=0}^{n} \binom{n}{i}t^i(1-t)^{n-i}f(\frac{i}{n})\]converges uniformly to \(f\) as \(n \rightarrow \infty\). This asserts that continuous functions on (say) the unit interval \([0, 1]\) can be approximated by polynomials.
Proof: We first establish the pointwise convergence \(f_n(p) \rightarrow f(p)\) for \(0 \leq p \leq 1\). Fix \(p \in [0, 1]\) and let \(X_1, X_2, \dots\) be iid Bernoulli variables with probability \(p\). The empirical mean \(S_n/n\) takes values in the finite set \(\left\{0, 1/n, \dots, n/n \right\}\) (note that \(f(S_n/n)\) is thus dominated) with probability \(\binom{n}{i} p^i (1-p)^{n-i}\) for each \(i/n\), and so from the definition of the Bernstein polynomials \(f_n\) we see that
\[\mathbf{E}f(\frac{X_1 + \cdots + X_n}{n}) = f_n(p).\]The mean of the \(X_i\) is \(p\) and the variance is finite, so by the weak law of large numbers for random variables of finite second moment, we see that the empirical mean \(S_n/n\) converges to the probability \(p\) in probability. By the dominated convergence theorem in probability, we conclude the pointwise convergence that \(\mathbf{E}f(S_n/n)\) converges to \(f(p)\).
To establish uniform convergence, we use the proof of the weak law of large numbers rather than the statement of that law, to get the desired uniformity in the parameter \(p\). For a given \(p\), we see from the variant of Chebyshev inequality that \[\mathbf{P}(|\frac{S_n}{n} - p| \geq \delta) \leq \frac{1}{n}\frac{p(1-p)}{\delta^2}\]for any \(\delta > 0\). From hypothesis, \(f \) is uniformly continuous on \([0, 1]\), and so for any \(\varepsilon > 0\) there exists a \(\delta > 0\) such that \(|f(x) - f(y)| < \varepsilon\) whenever \(x, y \in [0, 1]\) with \(|x - y| < \delta\). For such an \(\varepsilon\) and \(\delta\), we conclude (by the monotonicity of probability) that\[\mathbf{P}(|f(\frac{S_n}{n}) - f(p)| \geq \varepsilon) \leq \frac{1}{n}\frac{p(1-p)}{\delta^2}.\]Also by hypothesis, \(f\) must be bounded in magnitude by some bound \(M\), so that \(|f(S_n/n) - f(p)| \leq 2M\). This leads to the upper bound \[\mathbf{E}|f(\frac{S_n}{n}) - f(p)| \leq \varepsilon +
\frac{2M}{n}\frac{p(1-p)}{\delta^2}\] and thus by the triangle inequality and the identity \(|\mathbf{E}(f(S_n/n) - f(p))| = |f_n(p) - f(p)|\)\[|f_n(p) - f(p)| \leq \varepsilon +
\frac{2M}{n}\frac{p(1-p)}{\delta^2}.\]
Since \(p(1-p)\) is clearly bounded and \(\varepsilon\) can be made arbitrarily small, we conclude that \(f_n\) converge uniformly to \(f\), as required.
The first and second moment methods are very general, and apply to sums \(S_n = X_1 + \cdots + X_n\) of random variables \(X_1, \dots, X_n\) that do not need to be identically distributed, or even independent (Although the bounds can get weaker and more complicated if one deviates too far from these hypotheses). For instance it is clear from linearity of expectation that \(S_n\) has mean \[\mathbf{E}S_n = \mathbf{E}X_1 + \cdots + \mathbf{E}X_n\]
(assuming of course that \(X_1, \dots, X_n\) are absolutely integrable) and variance \[\mathbf{Var}(S_n) = \mathbf{Var}(X_1) + \cdots +\mathbf{Var}(X_n) + \sum_{1\leq i,j \leq n; i\ne j} \mathrm{Cov}(X_i, X_j)\](assuming now that \(X_1, \dots , X_n\) are square-integrable). If the \(X_1, \cdots, X_n\) are pairwise independent in addition to being square-ntegrable, then all hte covariances vanish and we obtain additivity of the variance:\[\mathbf{Var}(S_n) = \mathbf{Var}(X_1) + \cdots + \mathbf{Var}(X_n).\]
Remark
- Viewing the variance as the square of the standard deviation, the above additive identity of the variance can be seen as a rigorous demonstration of the following heuristic principle of square root cancellation: if one has a sum \(S_n = X_1 + \cdots + X_n\) of random (or pseudorandom) variables that "oscillate" (in the sense that their mean is either zero or close to zero) and each \(X_i\) has an expected magnitude of about \(a_i\) (in the sense that a statistic such as the standard deviation of \(X_i\) is comparable to \(a_i\)), and the \(X_i\) "behave independently", then the sum \(S_n\) is expected to have a magnitude of about \((\Sigma_{i=1}^n |a_i|^2)^{1/2}\). For instance, a sum of \(n\) unbiased signs \(X_i \in \left\{ -1, +1\right\}\) would be expected to have magnitude about \(\sqrt n\) unless the \(X_i\) exhibit strong mutual correlations. This principle turns out to be applicable to a remarkably broad range of situations, even for which no randomness is evidently perceived (e.g. in considering the type of exponential sums that occur in analytic number theory). We will see some further instantiation of this principle in later notes.