- ) Suppose that \(\Sigma_n \mathbf{P}(|X_n - X| > \varepsilon) < \infty\) for all \(\varepsilon > 0\). It is a condition for \(X_n \rightarrow X\) almost surely. Hence, a sequence of random variables, with summability of the probabilities \(\mathbf{P}(|X_n - X| > \varepsilon)\) for any threshold \(\varepsilon > 0\), is almost surely convergent. This technique is convenient in proving almost sure convergence from summable bounds for convergence in probability. Give a counterexample for the converse.
- This gives a (strictly) partial converse to 1: If \(X_n \rightarrow X\) in probability, then one can find a subsequence \(X_{n_j}\) such that \(X_{n_j} \rightarrow X\) almost surely
- If \(\mathbf{E}|X_n|, \mathbf{E}|X| < \infty\) and \(E|X_n - X| \rightarrow 0\), then \(\mathbf{P}(|X_n - X| > \varepsilon) \rightarrow 0\) as \(n \rightarrow \infty\) for every \(\varepsilon > 0\); thus, if absolutely integrable variables \(X_n\) converges to the putative value \(X\), which is another absolutely integrable variable, in expectation, then \(X_n\) converges to \(X\) in probability. Give a counterexample so that the former notion is strictly stronger.
- (Urysohn subsequence principle) Suppose that every subsequence \(X_{n_j}\)of \(X_n\) has a further subsequence \(X_{n_{j_k}}\)that converges to \(X\) in probability. It is a necessary and sufficient condition for the entire sequence \(X_n\) to converge to \(X\) in probability.
- Does the Urysohn subsequence principle still hold if one replaces "in probability" thoroughly with "almost surely"?
- A continuous function preserves convergence in probability: If \(F: \mathbf{R} \rightarrow \mathbf{R}\) or \(\mathbf{C} \rightarrow \mathbf{C}\) is a continuous function and \(X_n \rightarrow X\) in probability, then \(F(X_n) \rightarrow F(X)\) in probability also. This holds more generally for several-variables functions \(F: \mathbf{R}^k \rightarrow \mathbf{R}\) or \(\mathbf{C}^k \rightarrow \mathbf{C}\); for instance, if\(X_n, Y_n\) respectively converges to \(X, Y\) in probability then \(X_n + Y_n\) and \(X_n Y_n\) converges to \(X + Y\) and \(XY\) in probability also; thus, \(X_n + Y_n \rightarrow X + Y\) and \(X_n Y_n \rightarrow X Y\) in probability if \(X_n \rightarrow X, Y_n \rightarrow Y\) both in probability.
- (Fatou's lemma for convergence in probability) If \(X_n\) are unsigned and the event \(|X_n - X| > \varepsilon\) tends to be null as \(n\) increases to \(\infty\) at any scale \(\varepsilon > 0\), then \(\mathbf{E}X \leq \lim \inf_{n \rightarrow \infty} \mathbf{E}X_n\); the limit infimum \(\lim \inf_{n \rightarrow \infty} \mathbf{E}X_n\) gives an upper bound for the expectation \(\mathbf{E}X\).
- (Dominated convergence in probability) If \(X_n \rightarrow X\) in probability, and almost surely \(|X_n| \leq Y\) for all \(n\) and some \(\)\(Y\) with finite expectation, then \(\mathbf{E}X_n \rightarrow \mathbf{E}X\) as \(n \rightarrow \infty\); hence, \(\lim_{n \rightarrow \infty} \mathbf{E}X_n = \mathbf{E}X\) if one has \(\lim_{n \rightarrow \infty} \mathbf{P}(|X_n - X| > \varepsilon) = 0\) and some randome variable \(Y\) with finite expectaion and dominance \(|X_n| \leq Y\).
Exercise 3
- (Independence of variables and limiting values in probability) Suppose we have a sequence \(X_1, X_2, \cdots\) of scalar random variables, another variable \(X\) and \(X_n \rightarrow X\) in probability. Then a random variable \(Y\) that is independent of the \(X_n\) is eventually independent of its limiting value \(X\).
- (Jointly independent variables and almost surely constant) If the \(X_n\) are jointly independent, then the putative limit \(X\) is almost surely constant.
We are now able to state the weak and strong laws of large numbers, in the specific case of identically distributed (iid for short) random variables.
Recall that:
- An identically distributed sequence \(X_n\) of random variables is a sequence such that the \(X_n\) are jointly independent, and each is a copy of another, so \(\mathbf{P}(X_n \in S) = \mathbf{P}(X_m \in S) \) for all measurable \(S\).
- A jointly independent sequence of random variables is \(X_1, X_2, \cdots\) whose distribution is the product of the \(X_n\)'s distributions: equivalently, \(\mathbf{P}((X_{i_1} \in S_1) \wedge \cdots \wedge (X_{i_n} \in S_n)) = \mathbf{P}(X_{i_1} \in S_1) \cdot \cdots \cdot \mathbf{P}(X_{i_n} \in S_n)\) for all finite subscripts \(i_1, \dots ,i_n\) and all measurable \(S_i\).
Theorem 4 (Law of large numbers, model case) Let \(X_1, X_2, \dots\) be an iid sequence of copies of copies of a random variable \(X\) with finite expectation.
- (Weak law of large numbers)Then, \((X_1 + \cdots + X_n)/n \rightarrow \mathbf{E}X\) in probability.
- (Strong law of large numbers) Then, \((X_1 + \cdots + X_n)/n \rightarrow \mathbf{E}X\) almost surely.
Remarks:
- The law of large numbers asserts roughly that the theoretical expectation \(\mathbf{E}X\) of a random variable \(X\) can be approximated by taking a large number of independent samples \(X_1, \dots, X_n\) of \(X\) and then forming the empirical mean \[\frac{X_1+\cdots+X_n}{n}.\]This capability to approximate the theoretical statistics of a probability distribution through empirical data is one of the foundational starting points for mathematics (which is not the focus of the course).
- The tendency of statistics to cluster closely around their mean value is the simplest instance of the concentration of measure phenomenon, which is of great significance not only within probability, but also in applications of probability such as statistics, theoretical computer science, combinatorics, random matrix theory, and high dimensional geometry. (See this blog post for further discussion.)
- Now is revelatory to compare LLN with what one can obtain from the Kolmogorov zero-one law: if the \(X_n\) are real-valued, then the limit superior \(\lim \sup_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n\) and limit inferior \(\lim \inf_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n\) are tail random variables, in the sense that they remain unaffected if one changes finite variables among the \(X_n\); in particular, events such as \(\lim \sup_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n > t\) are tail events for any \(t \in \mathbf{R}\). From this fact and the zero-one law we can guarantee that there exists deterministic constants \(-\infty \le \mu_- \le \mu_+ \leq \infty\) such that \(\lim \inf_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n = \mu_-\) and \(\lim \sup_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n = \mu_+\) almost surely. One can then view the strong law of large numbers as the assertion \(\mu_- = \mu_+ = \mathbf{E}X\) when \(X\) is absolutely integrable.
- Informally, if \(X_1, X_2, \dots \) are iid with mean \(\mu\), then \(X_1+\cdots X_n\approx \mu n\) for large \(n\)
There are several variants of the law of large numbers, for instance when one drops the hypothesis of identical distribution, or when the random variable \(X\)is not absolutely integrable, or if one seeks for more quantitative bounds on the rate of convergence; We will discuss some of these variants later on this note.
There are several ways to prove the law of large numbers. One main strategy is to use the moment method, the one to control statistics such as the empirical mean \((X_1 + \cdots X_n)/n\) by computing moments such as the mean \(\mathbf{E}(X_1 + \cdots + X_n)/n\), the variance \(\mathbf{E}|(X_1 + \cdots + X_n)/n - \mathbf{E}(X_1 + \cdots + X_n)/n|^2\) or higher moments such as \(\mathbf{E}|(X_1 + \cdots + X_n)/n - \mathbf{E}(X_1 + \cdots + X_n)/n|^k\) for \(k=4,6, \dots\). Fortunately, the joint independence of the \(X_n\)allows us to compute such moments fairly easily, with some elementary combinatorics. A direct application of the moment method typically requires one to add a finite moment assumption on the putative value \(X\) such as \(\mathbf{E}|X|^k < \infty\), as yet we shall see, one can reduce fairly easily to this case by so-called a truncation argument. For the strong law of large numbers, one can also use methods from the domain of the theory of martingales, such as stopping time arguments and maximal inequalities; we will present some classical arguments of Kolmogorov in this context.
1. The moment method
We begin by using the moment method to establish both the strong and weak law of large numbers for sums of iid random variables, under additional moment hypotheses. To prove the versions for complex variables, it suffices to do so for real ones, since one can take real and imaginary parts after establishing the real case. Thus we shall focus our attention on real random variables henceforth.
Recall that:
- By the change of variable formula, two real random variables \(X\) and \(Y\) with the same distribution \(\mu\) have the same expectation \(\mathbf{E}|X| = \mathbf{E}|Y|\): thus, if \(\mathbf{P}(X \in S) = \mathbf{P}(Y \in S)\) for all measurable \(S\) then \(\mathbf{E}|X| = \mathbf{E}|Y|\).
- In particular, if \(X_1, X_2, \dots\)is an iid sequence of copies of an absolutely integrable random real variable \(X\) with mean \(\mathbf{E}X = \mu\), then the \(X_n\)all have the same mean \(\mathbf{E}X_n = \mu\).
- Adding or subtracting a constant to, from a random variable does not affect its variance; the expectation of a constant inserted and itself will cancel out: \(\mathbf{Var}(X - \mu) = \mathbf{E}|X - \mathbf{E}X|^2 = \mathbf{Var}(X)\)
The first moment
Let \(X_1, X_2, \dots\) be a sequence of iid copies of a scalar random variable \(X\), and put our partial sums \(S_n := X_1 + \cdots + X_n\). Suppose that \(X\) is absolutely integrable with mean \(\mathbf{E}X = \mu\). Then one can use linearity of expectation to compute the expectation, the first moment, of the partial sum \(S_n\): \[\mathbf{E}X_1 + \cdots + \mathbf{E}X_n = n\mu.\]
In particular, the mean \(\mu\) can be written as the expectation of the empirical mean \((X_1 + \cdots + X_n)/n\). This looks consistent with the strong and weak laws of large numbers. It yet does not instantly imply these laws, but we can record the following really weak bound for the moment \[\mathbf{P}((X_1+\cdots+X_n)/n\ge\lambda n\mu)\le\frac{1}{\lambda}\]for any allotted slot \(\lambda\ >0\) in the case that \(X\) is unsigned and \(\mathbf{E}X\) is finite. Therefore, in the unsigned case, the empirical \(\frac{(X_1 + \cdots + X_n)}{n}\) usually doesn't get much larger than the mean \(\mu\). We will refer to this inequality as a first moment bound on the \(S_n\) (as was obtained primarily through computations of the first moment \(\mathbf{E}S_n\)).
Second moments
We now turn to second moment bounds, occurring through computations of second moments such as \(\mathbf{E}|(X_1 + \cdots + X_n)/n|^2\)or \(\mathbf{Var}(S_n) = \mathbf{E}|S_n - \mathbf{E}S_n|^2\). While so, it will be convenient to normalize the mean \(\mu\) to equal zero, by replacing each \(X_i\) with \(X_i - \mu\) and \(X\) by \(X - \mu\), so that the partial sum \(S_n\) gets replaced by \(S_n - n\mu\) and the empirical mean \(S_n/n\) with \(S_n/n - \mu\). With this normalization, we see that to prove the strong or weak law of large numbers, it suffices to do so in the zero case \(\mu = 0\). On the other hand, if one has \(X\)unsigned, then normalizing it in this fashion will almost certainly destroy the unsigned property of \(X\); so it is not always desirable.
Remarks
- one can compute the variance component-wisely using linearity:\(\mathbf{Var}(X_1 + \cdots + X_n) = \mathbf{E}|(X_1 - \mathbf{E}X_1) + \cdots + (X_n - \mathbf{E}S_n)|^2 = \mathbf{E}|(X_1 - \mu) + \cdots + (X_n - \mu)|^2\)
- the variance of the partial sum \(S_n\) already has several forms \(\mathbf{Var}(X_1 + \cdots + X_n) = \mathbf{E}|(X_1 + \cdots + X_n) - \mathbf{E}(X_1 + \cdots + X_n)|^2 = \mathbf{E}|(X_1 + \cdots + X_n) - n\mu|^2\)
- With the mean-zero normalization, the variance is the second moment: \(\mathbf{Var}(X) = \mathbf{E}|X|^2\); in particular, \(\mathbf{Var}(X_1 + \cdots + X_n) = \mathbf{E}|X_1 + \cdots + X_n|^2\). Hence, with mean equal zero, the variance is finite if and only if \(X\) is square-integrable.
- The calculation of the first moment shows that if \(\mu = \mathbf{E}X\) is normalized, then also the \(\mathbf{E}S_n\) is normalized.
Recall that
- Two random variables \(X, Y\) are independent if and only if \[\mathbf{E}F(X)G(Y) = (\mathbf{E}F(X))(\mathbf{E}G(Y))\] for any scalar function \(F, G\) on \(R, R'\), provided \(F(X), G(Y)\) are both unsigned or both absolutely integrable. In particular, we have \(\mathbf{E}XY = (\mathbf{E}X)(\mathbf{E}Y)\) while \(X, Y\) are independent.
- The Chebyshev inequality \(\mathbf{P}(|X - \mathbf{E}X| \geq t) \leq \frac{1}{t^2}\mathbf{Var}(X)\) for the second moment \(\mathbf{Var}(X) = \mathbf{E}|X - \mathbf{E}X|^2\).
- The Cauchy-Schwarz inequality \(|\mathbf{E}XY| \leq (\mathbf{E}|X|^2)^{1/2} (\mathbf{E}|Y|^2)^{1/2}\) is valid whenever \(X, Y\) are square-integrable: \(\mathbf{E}|X|^2, \mathbf{E}|Y|^2 < \infty\), or equivalently \(|\mathbf{E}XY|^2 \leq \sigma_X^2 \sigma_Y^2 = \mathbf{E}|XY|^2\), provided \(X,Y\) are both square-integrable or both absolutely integrable. Hence, the expectation of \(X\) times \(Y\) is bounded with the respective expectations of \(X\) and \(Y\).
We now start computing the variance of the partial sum \(S_n\). Suppose that \(X\) has finite second moment, which we write \(\mathbf{E}|X|^2 = \sigma^2\). Since a constant addition and subtraction does not affect the variance \(\mathbf{Var}(X - \mu) = \mathbf{Var}(X)\), we also assume \(X\) has been normalized to have mean zero. Then the variance of \(S_n\) is simply \(\mathbf{E}|X_1 + \cdots + X_n|^2\). By linearity of expectation, we have \[\mathbf{Var}(X_1 + \cdots + X_n) = \underset{1 \leq i,j \leq n}{\sum}\mathbf{E}X_i X_j.\]
All expressions here on the right side of the identity are absolutely integrable thanks to the Cauchy-Schwarz inequality. If \(i = j\), then the \(\mathbf{E}X_i X_j\) is equal to the second moment \(\mathbf{E}|X|^2 = \sigma^2\). If otherwise \(i \neq j\), then \(X_i, X_j\) are independent, have mean zero by hypothesis, and thus \[\mathbf{E}X_i X_j = (\mathbf{E}X_i)(\mathbf{E}X_j) = 0.\]
Putting these in, we obtain
\[\mathbf{Var}(X_1 + \cdots + X_n) = n \sigma^2\]
or equivalently, that of the empirical mean\[\mathbf{Var}((X_1 + \cdots + X_n)/n) = \frac{1}{n}\sigma^2.\]
This bound's been established in the mean zero case, but it is clear that this holds in general, since constant addition and subtraction does not affect the variance; for instance \(\mathbf{Var}(S_n/n) = \mathbf{Var}(S_n/n - \mu) = \frac{1}{n}\sigma^2\), where we write \(\sigma^2 = \mathbf{E}|X - \mu|^2\).
This is the first demonstration of concentrations of measure effect that come to arise from operating many independent random variables \(X_1, \dots X_n\). At the opposite extremal case, we may take \(X_1, \dots, X_n\) all to be exactly the same random variable: \(X_1 = \cdots = X_n = X\). Then \(\frac{X_1 + \cdots + X_n}{n} = X\)has exactly the same mean and variance as X (obviously).
Remarks:
- Think of how we put \(\sigma^2 = \mathbf{E}|X|^2\): hence we have the form \(\sigma = (\mathbf{E}|X|^2)^{1/2}\); I've begun to wonder how cardinal this form is, remembering the inequality of Cauchy and Schwarz \(|\mathbf{E}XY| \leq (\mathbf{E}|X|^2)^{1/2} (\mathbf{E}|Y|^2)^{1/2}\)\(|\mathbf{E}XY|^2 \leq \sigma_X^2 \sigma_Y^2 = \mathbf{E}|XY|^2\).
Now, let us examine the behavior of the mean \(S_n/n\). If we insert the variance bound we've obtained into Chebyshev's inequality, we acquire \[\mathbf{P}(|\frac{S_n}{n} - \mu| \geq \varepsilon) \leq \frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\]
for any natural number \(n\) and slot \(\varepsilon > 0\), whenever \(X\) holds mean \(\mu\) and a finite variance \(\sigma^2 < \infty\). Hence, \(\frac{X_1 + \cdots + X_n}{n} \rightarrow \mu\) in probability: we have proved the weak law of large numbers, in the case of \(X\)with finite variance.
Note that the last bound of the \(S_n/n\) can be modified, so that \[\frac{\sqrt{n}}{\omega(n)}|\frac{X_1 + \cdots + X_n}{n} - \mu|\]
converges to zero in probability whenever \(\omega(n)\) is a function of \(n\) that eventually goes to infinity as \(n \rightarrow \infty\). For instance we have\[\frac{\sqrt n}{\log n}|\frac{S_n}{n} - \mu| \rightarrow 0\]
in probability. An informal intuition of this result is that the empirical mean \(S_n/n\) tends to stray from \(\mu\) by not much more than \(\sqrt n\). This understanding will be reinforced later when we study the central limit theorem and related results such as the Chernoff inequality. It is enhanced also by the law of the iterated logarithm, which we will not be able to get to in this set of studies.
Recall that
- Jensen's inequality \[f(\mathbf{E}X) \leq \mathbf{E}f(X)\] for a real random variable \(X\) and a convex function \(f: \mathbf{R} \rightarrow \mathbf{R}\) where \(X, f(X)\) both have finite expectation.
One can also hope to use the bound \(\mathbf{P}(|\frac{S_n}{n} - \mu| \geq \varepsilon) \leq \frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\) together with the Borel-Cantelli lemma to obtain the strong law of large numbers in the second moment case,