\documentclass[10pt]{article}
\usepackage{fullpage}
\usepackage{setspace}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage[section]{placeins}
\usepackage{xcolor}
\usepackage{breakcites}
\usepackage{lineno}
\usepackage{hyphenat}
\PassOptionsToPackage{hyphens}{url}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
\usepackage{natbib}
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[greek,ngerman,english]{babel}
\usepackage{float}
% Edit this header.tex file to include frontmatter definitions and global macros
% Add here any LaTeX packages you would like to load in all document blocks
% \usepackage{xspace}
\usepackage{amsfonts}
% Add here any LaTeX macros you would like to load in all document blocks
% \def\example{This is an example macro.}
\newcommand{\msun}{\,\mathrm{M}_\odot}
%\def\realnumbers{\mathbf{R}}
\newcommand{\realnumbers}{\mathbf{R}}
% -----
\iflatexml
% Add here any LaTeXML-specific commands
% -----
\else
% Add here any export style-specific LaTeX commands. These will only be loaded upon document export.
% \paperfield{Subject domain of my document}
% \keywords{keyword1, keyword2}
% \corraddress{Author One PhD, Department, Institution, City, State or Province, Postal Code, Country}
% \fundinginfo{Funder One, Funder One Department, Grant/Award Number: 123456.}
\fi
\begin{document}
\title{The weak and strong law of large numbers}
\author[1]{Shion T. Fujie}%
\affil[1]{Affiliation not available}%
\vspace{-1em}
\date{\today}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\sloppy
One of the primary activities in probability theory is studying the
various statistics that can be formed from a complex system with many
components. For instance, one of the simplest systems one can consider
at simplest is a finite or infinite sequence~\(X_1,\ ...\ ,\ X_n\ \)
or~\(X_1,\ X_2,\ ...\) of jointly independent scalar random variables
with the case that the~\(X_i\) are identically distributed
(or the~\(X_i\) are iid for short), being a model case of
interest. There are an abundance of statistics that one can study with
such sequences, but one of the most basic such statistics are the
partial sums
\[ S_n := X_1 + \cdots + X_n. \]
The first fundamental result about these sums in the law of large
numbers (or LLN for short), which comes in two formulations, weak and
strong (WLLN and SLLN respectively). To present these laws, we will
begin with the notion of convergence in probability.
\textbf{Definition 1~}Let~\(X_n\) be a sequence of scala
random variables taking values in~\(\mathbf{R}\)
or~\(\mathbf{C}\), and let~\(X\) be another scalar
random variable. We then say that~\(X_n\) converges in
probability to~\(X\) if, for every
radius~\(\varepsilon > 0\), one has~\(\mathbf{P}(|X_n - X| > \varepsilon) \rightarrow 0\)
as~\(n \rightarrow \infty\).~
Hence,~\(X_n\) converges to~\(X\) in
probability if~\(|X_n - X| > \varepsilon\) will almost surely be false
as~\(n\) ascends to~\(\infty\).
Recall that a sequence~\(X_1,X_2,\dots\) converges almost surely if
one has~\({\lim \inf }_{n \rightarrow \infty} X_n = {\lim \sup }_{n \rightarrow \infty} X_n\), or equivalently if the probabilities
of~\(\bigvee_{n \geq N} (|X_n-X|>\selectlanguage{greek}ε\selectlanguage{english})\) eventually reach zero as~\(n\)
ascends to~\(\infty\). Roughly speaking, convergence in
probability is useful for controlling how each of the individual random
variables~\(X_n\) is close to the putative
limit~\(X\), while almost sure convergence is useful for
controlling how the entire tail~\((X_n)_{n \geq N}\) of the sequence is
close to its putative limiting value~\(X\).
We have the following relationships between convergence in probability
and almost sure convergence:
\textbf{Exercise 2~}Let~\(X_n, X\)be scalar random variables.
\begin{enumerate}
\tightlist
\item
Almost sure convergence implies convergence in probability; hence, the
tail~\((X_n)_{n \geq N}\) approaching to~\(X\) is
somewhat stronger than the individual~\(X_n\) approaching
to~\(X\); thus, if~\(X_n \rightarrow X\) almost surely,
then~\(X_n \rightarrow X\) in probability. Give a counterexample so that
the converse does not hold.
\item
(Borel-Cantelli lemma) Suppose that~\(\Sigma_n \mathbf{P}(|X_n - X| > \varepsilon) < \infty\) for
all~\(\varepsilon > 0\). It is a condition for~\(X_n \rightarrow X\)
almost surely. Hence, a sequence of random variables, with summability
of the probabilities~\(\mathbf{P}(|X_n - X| > \varepsilon)\) for any threshold
\(\varepsilon > 0\), is almost surely convergent. This technique is
convenient in proving almost sure convergence from summable bounds for
convergence in probability. Give a counterexample for the converse.
\item
This gives a (strictly) partial converse to 1: If~\(X_n \rightarrow X\)
in probability, then one can find a subsequence~\(X_{n_j}\)
such that~\(X_{n_j} \rightarrow X\) almost surely
\item
If~\(\mathbf{E}|X_n|, \mathbf{E}|X| < \infty\) and~\(E|X_n - X| \rightarrow 0\),
then~\(\mathbf{P}(|X_n - X| > \varepsilon) \rightarrow 0\) as~\(n \rightarrow \infty\) for
every~\(\varepsilon > 0\); thus, if absolutely integrable
variables~\(X_n\) converges to the putative
value~\(X\), which is another absolutely integrable
variable, in expectation, then~\(X_n\) converges
to~\(X\) in probability. Give a counterexample so that
the former notion is strictly stronger.
\item
(Urysohn subsequence principle) Suppose that every
subsequence~\(X_{n_j}\)of~\(X_n\) has a further
subsequence~\(X_{n_{j_k}}\)that converges to~\(X\)
in probability. It is a necessary and sufficient condition for the
entire sequence~\(X_n\) to converge to~\(X\)
in probability.
\item
Does the Urysohn subsequence principle still hold if one replaces ``in
probability'' thoroughly with ``almost surely''?
\item
A continuous function preserves convergence in probability:
If~\(F: \mathbf{R} \rightarrow \mathbf{R}\) or~\(\mathbf{C} \rightarrow \mathbf{C}\) is a continuous function
and~\(X_n \rightarrow X\) in probability, then~\(F(X_n) \rightarrow F(X)\) in
probability also. This holds more generally for several-variables
functions~\(F: \mathbf{R}^k \rightarrow \mathbf{R}\) or~\(\mathbf{C}^k \rightarrow \mathbf{C}\); for instance,
if\(X_n, Y_n\)~respectively converges to~\(X, Y\) in
probability then~\(X_n + Y_n\) and~\(X_n Y_n\) converges
to~\(X + Y\) and~\(XY\) in probability also;
thus,~\(X_n + Y_n \rightarrow X + Y\) and~\(X_n Y_n \rightarrow X Y\) in probability
if~\(X_n \rightarrow X, Y_n \rightarrow Y\) both in probability.
\item
(Fatou's lemma for convergence in probability) If~\(X_n\)
are unsigned and the events~\(|X_n - X| > \varepsilon\) tend to be null
as~\(n\) increases to~\(\infty\) at any
scale~\(\varepsilon > 0\), then~\(\mathbf{E}X \leq \lim \inf_{n \rightarrow \infty} \mathbf{E}X_n\); the limit
infimum~\(\lim \inf_{n \rightarrow \infty} \mathbf{E}X_n\) gives an upper bound for the
expectation~\(\mathbf{E}X\).
\item
(Dominated convergence in probability) If~\(X_n \rightarrow X\) in
probability, and almost surely~\(|X_n| \leq Y\) for
all~~\(n\) and
some~\(\)\(Y\) with finite expectation,
then~\(\mathbf{E}X_n \rightarrow \mathbf{E}X\) as~\(n \rightarrow \infty\);
hence,~\(\lim_{n \rightarrow \infty} \mathbf{E}X_n = \mathbf{E}X\) if one has~\(\lim_{n \rightarrow \infty} \mathbf{P}(|X_n - X| > \varepsilon) = 0\) and some
randome variable~\(Y\) with finite expectaion and
dominance~\(|X_n| \leq Y\).
\end{enumerate}
\textbf{Exercise 3}~
\begin{itemize}
\tightlist
\item
(Independence of variables and limiting values in probability) Suppose
we have a sequence~\(X_1, X_2, \cdots\) of scalar random variables,
another variable~\(X\) and~\(X_n \rightarrow X\) in
probability. Then a random variable~\(Y\) that is
independent of the~\(X_n\) is eventually independent of
its limiting value~\(X\).
\item
(Jointly independent variables and almost surely constant) If
the~\(X_n\) are jointly independent, then the putative
limit~\(X\) is almost surely constant.
\end{itemize}
We are now able to state the weak and strong laws of large numbers, in
the specific case of identically distributed (iid for short) random
variables.
Recall that:
\begin{itemize}
\tightlist
\item
An identically distributed sequence~\(X_n\) of random
variables is a sequence such that the~\(X_n\) are jointly
independent, and each is a copy of another, so~\(\mathbf{P}(X_n \in S) = \mathbf{P}(X_m \in S) \) for
all measurable~\(S\).
\item
A jointly independent sequence of random variables
is~\(X_1, X_2, \cdots\) whose distribution is the product of
the~\(X_n\)'s distributions:
equivalently,~\(\mathbf{P}((X_{i_1} \in S_1) \wedge \cdots \wedge (X_{i_n} \in S_n)) = \mathbf{P}(X_{i_1} \in S_1) \cdot \cdots \cdot \mathbf{P}(X_{i_n} \in S_n)\) for all finite
subscripts~\(i_1, \dots ,i_n\) and all
measurable~\(S_i\).
\end{itemize}
\textbf{Theorem 4 (Law of large numbers, model
case)}~Let~\(X_1, X_2, \dots\) be an iid sequence of copies of copies
of a random variable~\(X\) with finite expectation.
\begin{enumerate}
\tightlist
\item
(Weak law of large numbers)Then,~\((X_1 + \cdots + X_n)/n \rightarrow \mathbf{E}X\) in probability.
\item
(Strong law of large numbers) Then,~\((X_1 + \cdots + X_n)/n \rightarrow \mathbf{E}X\) almost
surely.
\end{enumerate}
Remarks:
\begin{itemize}
\tightlist
\item
The law of large numbers asserts roughly that the theoretical
expectation~\(\mathbf{E}X\) of a random
variable~\(X\) can be approximated by taking a large
number of independent samples~\(X_1, \dots, X_n\)
of~\(X\) and then forming the empirical
mean~\[\frac{X_1+\cdots+X_n}{n}.\]This capability to approximate the
theoretical statistics of a probability distribution through empirical
data is one of the foundational starting points for mathematics (which
is not the focus of the course).
\item
The tendency of statistics to cluster closely around their mean value
is the simplest instance of the concentration of measure phenomenon,
which is of great significance not only within probability, but also
in applications of probability such as statistics, theoretical
computer science, combinatorics, random matrix theory, and high
dimensional geometry. (See
\href{https://terrytao.wordpress.com/2010/01/03/254a-notes-1-concentration-of-measure/}{this
blog post} for further discussion.)
\item
Now is revelatory to compare LLN with what one can obtain from
the~\href{https://en.wikipedia.org/wiki/Kolmogorov\%27s_zero\%E2\%80\%93one_law}{Kolmogorov
zero-one law}: if the~\(X_n\) are real-valued, then the
limit superior~\(\lim \sup_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n\) and limit
inferior~\(\lim \inf_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n\) are tail random variables, in the sense
that they remain unaffected if one changes finite variables among
the~\(X_n\); in particular, events such
as~\(\lim \sup_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n > t\) are tail events for any~\(t \in \mathbf{R}\).
From this fact and the zero-one law we can guarantee that there exists
deterministic constants~\(-\infty \le \mu_- \le \mu_+ \leq \infty\) such
that~\(\lim \inf_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n = \mu_-\) and~\(\lim \sup_{n\rightarrow\infty} (X_1 + \cdots + X_n)/n = \mu_+\) almost surely. One
can then view the strong law of large numbers as the
assertion~\(\mu_- = \mu_+ = \mathbf{E}X\) when~\(X\) is absolutely
integrable.
\item
Informally, if~\(X_1, X_2, \dots \) are iid with
mean~\(\mu\), then~\(X_1+\cdots X_n\approx \mu n\) for
large~\(n\)
\end{itemize}
There are several variants of the law of large numbers, for instance
when one drops the hypothesis of identical distribution, or when the
random variable~\(X\)is not absolutely integrable, or if
one seeks for more quantitative bounds on the rate of convergence; We
will discuss some of these variants later on this note.
There are several ways to prove the law of large numbers. One main
strategy is to use the moment method, the one to control statistics such
as the empirical mean~~\((X_1 + \cdots X_n)/n\) by computing moments such as
the mean~\(\mathbf{E}(X_1 + \cdots + X_n)/n\), the variance~\(\mathbf{E}|(X_1 + \cdots + X_n)/n - \mathbf{E}(X_1 + \cdots + X_n)/n|^2\) or
higher moments such as~\(\mathbf{E}|(X_1 + \cdots + X_n)/n - \mathbf{E}(X_1 + \cdots + X_n)/n|^k\) for~\(k=4,6, \dots\).
Fortunately, the joint independence of the~\(X_n\)allows us
to compute such moments fairly easily, with some elementary
combinatorics. A direct application of the moment method typically
requires one to add a finite moment assumption on the putative
value~\(X\)~such as~\(\mathbf{E}|X|^k < \infty\), as yet we shall
see, one can reduce fairly easily to this case by so-called a truncation
argument. For the strong law of large numbers, one can also use methods
from the domain of the theory of martingales, such as stopping time
arguments and maximal inequalities; we will present some classical
arguments of Kolmogorov in this context.
\section*{1. The moment method}
{\label{377500}}
We begin by using the moment method to establish both the strong and
weak law of large numbers for sums of iid random variables, under
additional moment hypotheses. To prove the versions for complex
variables, it suffices to do so for real ones, since one can take real
and imaginary parts after establishing the real case. Thus we shall
focus our attention on real random variables henceforth.
Recall that:
\begin{itemize}
\tightlist
\item
By the change of variable formula, two real random
variables~\(X\) and~\(Y\) with the same
distribution~\(\mu\)~have the same
expectation~\(\mathbf{E}|X| = \mathbf{E}|Y|\): thus, if~\(\mathbf{P}(X \in S) = \mathbf{P}(Y \in S)\)~for
all measurable \(S\) then~\(\mathbf{E}|X| = \mathbf{E}|Y|\).
\item
In particular, if~\(X_1, X_2, \dots\)is an iid sequence of copies of
an absolutely integrable random real variable~\(X\)
with mean~\(\mathbf{E}X = \mu\), then the~\(X_n\)all have
the same mean~\(\mathbf{E}X_n = \mu\).
\item
Adding or subtracting a constant to, from a random variable does not
affect its variance; the expectation of a constant inserted and itself
will cancel out:~\(\mathbf{Var}(X - \mu) = \mathbf{E}|X - \mathbf{E}X|^2 = \mathbf{Var}(X)\)
\end{itemize}
\subsection*{The first moment}
{\label{891516}}
Let~\(X_1, X_2, \dots\) be a sequence of iid copies of a scalar random
variable~\(X\), and put our partial
sums~\(S_n := X_1 + \cdots + X_n\). Suppose that~\(X\) is
absolutely integrable with mean~\(\mathbf{E}X = \mu\). Then one can use
linearity of expectation to compute the expectation, the first moment,
of the partial sum~\(S_n\):~\[\mathbf{E}X_1 + \cdots + \mathbf{E}X_n = n\mu.\]
In particular,~the mean~\(\mu\)~can be written as the
expectation of the empirical mean~\((X_1 + \cdots + X_n)/n\). This looks
consistent with the strong and weak laws of large numbers. It yet does
not instantly imply these laws, but we can record the following really
weak bound for the moment~\[\mathbf{P}((X_1+\cdots+X_n)/n\ge\lambda n\mu)\le\frac{1}{\lambda}\]for any allotted
slot~\(\lambda\ >0\) in the case that~\(X\) is
unsigned and~\(\mathbf{E}X\) is finite. Therefore, in the unsigned
case, the empirical~\(\frac{(X_1 + \cdots + X_n)}{n}\) usually doesn't get much larger
than the mean~\(\mu\). We will refer to this inequality as
a first moment bound on the~\(S_n\) (as was obtained
primarily through computations of the first moment
\(\mathbf{E}S_n\)).~
\subsection*{Second moments}
{\label{679267}}
We now turn to second moment bounds, occurring through computations of
second moments such as~\(\mathbf{E}|(X_1 + \cdots + X_n)/n|^2\)or~\(\mathbf{Var}(S_n) = \mathbf{E}|S_n - \mathbf{E}S_n|^2\).
While so, it will be convenient to normalize the
mean~\(\mu\) to equal zero, by replacing
each~\(X_i\) with ~\(X_i - \mu\)
and~\(X\) by~\(X - \mu\), so that the partial
sum~\(S_n\) gets replaced by~\(S_n - n\mu\) and the
empirical mean~\(S_n/n\) with~\(S_n/n - \mu\). With this
normalization, we see that to prove the strong or weak law of large
numbers, it suffices to do so in the zero case~\(\mu = 0\). On
the other hand, if one has~\(X\)unsigned, then
normalizing it in this fashion will almost certainly destroy the
unsigned property of~\(X\); so it is not always
desirable.
Remarks
\begin{itemize}
\tightlist
\item
one can compute the variance component-wisely using
linearity:\(\mathbf{Var}(X_1 + \cdots + X_n) = \mathbf{E}|(X_1 - \mathbf{E}X_1) + \cdots + (X_n - \mathbf{E}S_n)|^2 = \mathbf{E}|(X_1 - \mu) + \cdots + (X_n - \mu)|^2\)
\item
the variance of the partial sum~\(S_n\) already has
several forms~\(\mathbf{Var}(X_1 + \cdots + X_n) = \mathbf{E}|(X_1 + \cdots + X_n) - \mathbf{E}(X_1 + \cdots + X_n)|^2 = \mathbf{E}|(X_1 + \cdots + X_n) - n\mu|^2\)
\item
With the mean-zero normalization, the variance is the second
moment:~\(\mathbf{Var}(X) = \mathbf{E}|X|^2\); in particular,~\(\mathbf{Var}(X_1 + \cdots + X_n) = \mathbf{E}|X_1 + \cdots + X_n|^2\).
Hence, with mean equal zero, the variance is finite if and only
if~\(X\) is square-integrable.
\item
The calculation of the first moment shows that if~\(\mu = \mathbf{E}X\)
is normalized, then also the~\(\mathbf{E}S_n\) is normalized.~
\end{itemize}
Recall that~
\begin{itemize}
\tightlist
\item
Two random variables~\(X, Y\) are independent if and only
if~\[\mathbf{E}F(X)G(Y) = (\mathbf{E}F(X))(\mathbf{E}G(Y))\] for any scalar function~\(F, G\)
on~\(R, R'\), provided~\(F(X), G(Y)\) are both
unsigned or both absolutely integrable. In particular, we
have~\(\mathbf{E}XY = (\mathbf{E}X)(\mathbf{E}Y)\) while~~\(X, Y\) are independent.
\item
The Chebyshev inequality~\(\mathbf{P}(|X - \mathbf{E}X| \geq t) \leq \frac{1}{t^2}\mathbf{Var}(X)\) for the second
moment~\(\mathbf{Var}(X) = \mathbf{E}|X - \mathbf{E}X|^2\).
\item
The Cauchy-Schwarz inequality~\(|\mathbf{E}XY| \leq (\mathbf{E}|X|^2)^{1/2} (\mathbf{E}|Y|^2)^{1/2}\) is valid
whenever~\(X, Y\) are
square-integrable:~\(\mathbf{E}|X|^2, \mathbf{E}|Y|^2 < \infty\), or
equivalently~\(|\mathbf{E}XY|^2 \leq \sigma_X^2 \sigma_Y^2 = \mathbf{E}|XY|^2\), provided~\(X,Y\) are
both square-integrable or both absolutely integrable. Hence, the
expectation of~\(X\) times~\(Y\) is
bounded with the respective second moments of~\(X\)
and~\(Y\).
\end{itemize}
We now start computing the variance of the partial
sum~\(S_n\). Suppose that~\(X\)~has finite
second moment, which we write ~\(\mathbf{E}|X|^2 = \sigma^2\).~ Since a constant
addition and subtraction does not affect the
variance~\(\mathbf{Var}(X - \mu) = \mathbf{Var}(X)\), we also assume~\(X\) has
been~normalized to have mean zero. Then the variance of
\(S_n\) is simply~\(\mathbf{E}|X_1 + \cdots + X_n|^2\). By linearity of
expectation, we have~\[\mathbf{Var}(X_1 + \cdots + X_n) = \underset{1 \leq i,j \leq n}{\sum}\mathbf{E}X_i X_j.\]
All expressions here on the right side of the identity are absolutely
integrable thanks to the Cauchy-Schwarz inequality.
If~\(i = j\), then the~\(\mathbf{E}X_i X_j\) is equal to the
second moment~\(\mathbf{E}|X|^2 = \sigma^2\). If otherwise~\(i \neq j\),
then~\(X_i, X_j\) are independent, have mean zero by hypothesis,
and thus~\[\mathbf{E}X_i X_j = (\mathbf{E}X_i)(\mathbf{E}X_j) = 0.\]
Putting these in, we obtain
\[\mathbf{Var}(X_1 + \cdots + X_n) = n \sigma^2\]
or equivalently, that of the empirical mean\[\mathbf{Var}((X_1 + \cdots + X_n)/n) = \frac{1}{n}\sigma^2.\]
This bound's been established in the mean zero case, but it is clear
that this holds in general, since constant addition and subtraction does
not affect the variance; for instance \(\mathbf{Var}(S_n/n) = \mathbf{Var}(S_n/n - \mu) = \frac{1}{n}\sigma^2\), where we
write \(\sigma^2 = \mathbf{E}|X - \mu|^2\).~
This is the first demonstration of concentrations of measure effect that
come to arise from operating many independent random
variables~\(X_1, \dots X_n\). At the opposite extremal case, we may
take~\(X_1, \dots, X_n\) all to be exactly the same random
variable:~\(X_1 = \cdots = X_n = X\). Then~\(\frac{X_1 + \cdots + X_n}{n} = X\)has exactly the
same mean and variance as X (obviously).
Remarks:
\begin{itemize}
\tightlist
\item
Think of how we put~\(\sigma^2 = \mathbf{E}|X|^2\): we have the
form~\(\sigma = (\mathbf{E}|X|^2)^{1/2}\); I've begun to wonder how cardinal this form
is, remembering the inequality of Cauchy and
Schwarz~\(|\mathbf{E}XY| \leq (\mathbf{E}|X|^2)^{1/2} (\mathbf{E}|Y|^2)^{1/2}\)\(|\mathbf{E}XY|^2 \leq \sigma_X^2 \sigma_Y^2 = \mathbf{E}|XY|^2\).
\end{itemize}
Now, let us examine the behavior of the mean~\(S_n/n\). If we
insert the variance bound we've obtained into Chebyshev's inequality, we
acquire~\[\mathbf{P}(|\frac{S_n}{n} - \mu| \geq \varepsilon) \leq \frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\]
for any natural number~\(n\) and
threshold~\(\varepsilon > 0\), whenever~\(X\) holds
mean~\(\mu\) and a finite variance~\(\sigma^2 < \infty\).
Hence,~\(\frac{X_1 + \cdots + X_n}{n} \rightarrow \mu\) in probability: we have proved the weak law
of large numbers, in the case of~\(X\)with finite
variance.
Note that the last bound of the~\(S_n/n\) can be modified, so
that~\[\frac{\sqrt{n}}{\omega(n)}|\frac{X_1 + \cdots + X_n}{n} - \mu|\]
converges to zero in probability whenever~\(\omega(n)\) is a
function of~\(n\) that eventually goes to infinity
as~\(n \rightarrow \infty\). For instance we have\[\frac{\sqrt n}{\log n}|\frac{S_n}{n} - \mu| \rightarrow 0\]
in probability. An~informal intuition of this result is that the
empirical mean~\(S_n/n\) tends to stray
from~\(\mu\) by not much more than~\(\sqrt n\).
This understanding will be reinforced later when we study the central
limit theorem and related results such as the Chernoff inequality. It is
enhanced also by the law of the~iterated~logarithm, which we will not be
able to get to in this set of studies.
Remarks
We have obtained in the proof above the probabilistic bound of the
empirical mean\[\mathbf{P}(|\frac{X_1 + \cdots + X_n}{n} - \mathbf{E}X| \geq \delta) \leq \frac{1}{n}\frac{\mathbf{Var}(X)}{\delta^2}\]for arbitrary~\(n\) and
threshold~\(\delta > 0\)~
\begin{itemize}
\tightlist
\item
Unfortunately, one cannot use the bounds~\(\mathbf{P}(|\frac{S_n}{n} - \mu| \geq \varepsilon) \leq \frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\) to
ensure~\(S_n/n\) to be almost surely convergent by the
Borel-Cantelli lemma; since the harmonic series~\(\Sigma_n \frac{1}{n}\)
diverges.
\end{itemize}
Recall that
\begin{itemize}
\tightlist
\item
Jensen's inequality~\[f(\mathbf{E}X) \leq \mathbf{E}f(X)\] for a real random
variable~\(X\) and a convex
function~\(f: \mathbf{R} \rightarrow \mathbf{R}\) where~\(X, f(X)\) both have
finite expectation. In particular for the random
variable~\(|X|^r \) and the function~\(f(x) := |x|^{\frac{s}{r}}\)
for~\(0 < r < s < \infty\), we have~\[\mathbf{E}|X|^{r} \leq (\mathbf{E}|X|^s)^{r/s}.\]Thus, if some
moment~\(\mathbf{E}|X|^s\) is finite, we can assure ourselves to have
all lower moments~\(\mathbf{E}|X|^r\) finite.
\item
H\selectlanguage{ngerman}ölder's inequality~\[|\mathbf{E}XY| \leq (\mathbf{E}|X|^p)^{1/p}(\mathbf{E}|Y|^q)^{1/q}\] for scalar random
variables~\(X, Y\) with~\(0 < p, q < \infty\)
satisfying~\(\frac{1}{p} + \frac{1}{q} = 1\). In particular, if we have four
copies~\(X_i, X_j, X_k, X_l\) of a random variable~\(X\)
with~\(\mathbf{E}|X|^4 < \infty\), then we have finite
moments~\(\mathbf{E}|X_i X_j|^2\) and hence~\(|\mathbf{E}X_i X_j X_k X_l| < \infty\).
\end{itemize}
\subsection*{Higher moments}
{\label{811586}}
One can also hope to use the bound~\(\mathbf{P}(|\frac{S_n}{n} - \mu| \geq \varepsilon) \leq \frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\) together with
the Borel-Cantelli~lemma to obtain the strong law of large numbers in
the second moment case, but unfortunately the bounding
quantities~\(\frac{1}{n}\frac{\sigma^2}{\varepsilon^2}\) are not summable
in~\(n\). To make the quantities summable, we will be
after moments higher than the second. One could try to calculate third
moments such as~\(\mathbf{E}S_n^3\), but this turns out not to convey
too much information, because of its signed nature; the
expression~\(\mathbf{E}|S_n|^3\) would in principle convey more usable
information, but is practically difficult to~compute
as~\(|S_n|^3\) is not a~polynomial combination of
the~\(X_i\). We~therefore move on to the fourth moment.
We~normalize~\(X\) to the mean zero
case~\(\mu = 0\), write~\(\sigma^2 = \mathbf{E}|X|^2\)~to~denote the
variance~of~\(X\), and now assume a finite fourth
moment~\(\mathbf{E}|X|^4 < \infty\); which, by the Hölder or Jensen
inequalities, implies that all lower moments such
as~\(\mathbf{E}|X|^2\) are finite. We then
expand~\(\mathbf{E}|S_n|^4\)\[\mathbf{E}|X_1 + \cdots + X_n|^4 = \sum_{1 \leq i,j,k,l \leq n} \mathbf{E}X_iX_jX_kX_l.\] Note that all
expectations here are absolutely integrable by the hypothesis of finite
fourth moment andöHolder's inequality. The~\(\mathbf{E}X_i X_j X_k X_l\) looks
complicated, but fortunately this time it simplifies in most cases.
Suppose for instance that~\(i\) differs
from~\(j,k,l\), then~\(X_i\)
and~\(X_j X_k X_l\) are independent and so~\[\mathbf{E}X_i X_j X_k X_l = 0.\]
Similarly for other permutations. This leaves only a fewe
quadruples~\(i, j, k, l\) for which the~\(\mathbf{E}X_i X_j X_k X_l\) could
be non-zero: the three cases~\(i = j \neq k = l, i = k \neq j = l, i = l \neq j = k\), where each of the
indices~\(i, j, k, l\) holds~equality with exactly one other
index; and the diagonal case~\(i = j = k = l\). If for
instance~\(i = j \neq k = l\), then~\[\mathbf{E}X_i X_j X_k X_l = \sigma^4\]
since~\(X_i X_j = X_i^2\) and~\(X_k X_l = X_k^2\) are independent.
Similarly for the cases~\(i = k \neq j = l, i = l \neq j = k\), which
contributes~\(3n(n-1)\sigma^4\) in total to the~\(\mathbf{E}|S_n|^4\).
Hence we conclude with~\[\mathbf{E}|X_1 + \cdots + X_n| = 3n(n - 1)\sigma^4 + n \mathbf{E}|X|^4\] and by Markov's inequality
and removing the normalization~\(\mu = 0\),
that~\[\mathbf{P}(|\frac{S_n}{n} - \mu| > \varepsilon) \leq \frac{3n(n-1)\sigma^4 + n \mathbf{E}|X - \mu|^4}{\varepsilon^4 n^4}.\]
The quantities on the right-hand side now decays
like~\(O(1/n^2)\). Thus, we are able to use the Borel-Cantelli
lemma and conclude the strong law of large numbers in the case, provided
one has finite fourth moment \(\mathbf{E}|X|^4 < \infty\).
\par\null
One can of course continue to compute higher and higher moments
of~\(S_n\) with suitable finite moment hypotheses
on~~\(X\), though as one can already see from the fourth
moment calculation, the computations become increasingly combinatorial
in nature. We will come back to this sort of analysis when we study the
central limit theorem.
\par\null
\subsection*{Applications of the moment
methods}
{\label{213063}}
We begin with two quick applications of the weak law of large numbers to
topics outside of probability. We will first present an explicit version
of the Weierstrass~approximation theorem.
Recall that~
\begin{itemize}
\tightlist
\item
~When a random variable~\(X\) can be written as a finite
linear combination~\(X = a_1 1_{E_1} + \cdots + a_n 1_{E_n}\) of indicator random variables
with mutually disjoint events~\(E_1, \dots ,E_n\),~ its
power~\(X^r\) has the particularly simple
form~\[X^r = a_1^r 1_{E_1} + \cdots + a_n^r 1_{E_n}.\]Hence the
variance~\[\mathbf{E}X^r = a_1^r p_1 + \cdots a_n^r p_n\]where~\(p_i\) is the probability
of each of the events~\(E_i\).~
\item
The aesthetic identity~\[(1 - p)^2 p + p^2(1-p) = p(1 - p).\]~
\item
If~\(X\) is a Bernoulli random variable with
probability~\(p\), then~\(X = 1_{X = 1}\) and the mean
is the probability itself:~\(\mathbf{E}X = p\);
Further,~~\(|X - p|^2 = (1-p)^2 1_{X = 1} + p^2 1_{X = 0}\) and the variance is~\(\mathbf{E}|X - p|^2 = p(1-p)\).
\item
Let~\(X_1, X_2, \dots\)~iid Bernoulli variables with
probability~\(p\). The empirical
mean~\(S_n/n\) is a random variable taking the
value~\(\frac{i}{n}, i=0, \dots ,n\) with probability~\[\mathbf{P}(\frac{X_1 + \cdots + X_n}{n} = \frac{i}{n}) = \binom{n}{i}p^i(1-p)^{n-i}\]; for
instance~\(\mathbf{P}(X_1 = 1 \wedge \cdots \wedge X_i = 1 \wedge X_{i+1} = 0 \wedge \cdots \wedge X_n = 0) = p^i(1-p)^{n-i}\) and similarly for permutations.
\item
Monotonicity of probability: the inclusion of events
~\(E \subset F\) implies the hierarchy of~\(\mathbf{P}(E) \leq \mathbf{P}(F)\).
\item
Given a random variable~\(X\) and two disjoint
events~\(E, F\) with~\(\mathbf{P}(E \vee F) = 1\),
if~\(|X|_E| \leq M_1\) and~\(|X|_F| \leq M_2\) then one can bound the
mean~\(\mathbf{E}|X|\) by~\[\mathbf{E}|X| \leq p M_1 + (1-p) M_2\]
where~\(p, 1-p \) are respectively the probabilities
of~\(E, F\). This can easily generalize to an arbitrary
finite number of mutually disjoint events.
\item
A sequence~\(f_1, f_2, \dots\) of functions is uniformly convergent if
it is convergent in the space of functions; Note the difference that
it is pointwisely convergent if the sequence~\(f_1(p), f_2(p), \dots\) is
convergent in the space of points at a particular
point~\(p\).
\end{itemize}
\textbf{Proposition 5 (Approximation by Bernstein polynomials)}
Let~\(f: [0, 1] \rightarrow \mathbf{R} \) be a continuous function. Then, the Bernstein
polynomials~\[f_n(t) := \sum_{i=0}^{n} \binom{n}{i}t^i(1-t)^{n-i}f(\frac{i}{n})\]converges uniformly
to~\(f\) as~\(n \rightarrow \infty\). This asserts that
continuous functions on (say) the unit interval~\([0, 1]\) can
be approximated by polynomials.
\emph{Proof:}~We first establish the pointwise
convergence~\(f_n(p) \rightarrow f(p)\) for~\(0 \leq p \leq 1\).
Fix~\(p \in [0, 1]\) and let~\(X_1, X_2, \dots\) be iid Bernoulli
variables with probability~\(p\). The empirical
mean~\(S_n/n\) takes values in the finite
set~\(\left\{0, 1/n, \dots, n/n \right\}\) (note that \(f(S_n/n)\) is thus
dominated) with probability~\(\binom{n}{i} p^i (1-p)^{n-i}\) for
each~\(i/n\), and so from the definition of the Bernstein
polynomials~\(f_n\) we see that~
\[\mathbf{E}f(\frac{X_1 + \cdots + X_n}{n}) = f_n(p).\]The mean of the~\(X_i\)
is~\(p\) and the variance is finite, so by the weak law of
large numbers for random variables of finite second moment, we see that
the empirical mean~\(S_n/n\) converges to the
probability~\(p\) in probability. By the dominated
convergence theorem in probability, we conclude the pointwise
convergence that~\(\mathbf{E}f(S_n/n)\) converges to~\(f(p)\).
To establish uniform~convergence, we use the proof of the weak law of
large numbers rather than the statement of that law, to get the
desired~uniformity in the~parameter~\(p\). For a
given~\(p\), we see from the variant of Chebyshev
inequality that~\[\mathbf{P}(|\frac{S_n}{n} - p| \geq \delta) \leq \frac{1}{n}\frac{p(1-p)}{\delta^2}\]for any~\(\delta > 0\). From
hypothesis,~\(f \) is uniformly continuous
on~\([0, 1]\), and so for any~\(\varepsilon > 0\) there exists
a~\(\delta > 0\) such that~~\(|f(x) - f(y)| < \varepsilon\)
whenever~\(x, y \in [0, 1]\) with~\(|x - y| < \delta\). For such
an~\(\varepsilon\) and~\(\delta\), we conclude (by the
monotonicity of probability) that\[\mathbf{P}(|f(\frac{S_n}{n}) - f(p)| \geq \varepsilon) \leq \frac{1}{n}\frac{p(1-p)}{\delta^2}.\]Also by
hypothesis,~\(f\) must be bounded in magnitude by some
bound~\(M\), so that~\(|f(S_n/n) - f(p)| \leq 2M\). This leads to
the upper bound~\[\mathbf{E}|f(\frac{S_n}{n}) - f(p)| \leq \varepsilon +
\frac{2M}{n}\frac{p(1-p)}{\delta^2}\] and thus by the triangle inequality
and the identity~\(|\mathbf{E}(f(S_n/n) - f(p))| = |f_n(p) - f(p)|\)\[|f_n(p) - f(p)| \leq \varepsilon +
\frac{2M}{n}\frac{p(1-p)}{\delta^2}.\]
Since~\(p(1-p)\) is clearly bounded and~\(\varepsilon\) can
be made arbitrarily small, we conclude that~\(f_n\) converge
uniformly to~\(f\), as required.~
\par\null
The first and second moment methods are very general, and apply to
sums~\(S_n = X_1 + \cdots + X_n\) of random variables~\(X_1, \dots, X_n\) that do
not need to be~identically distributed, or even independent (Although
the bounds can get weaker and more~complicated if one~deviates too far
from these hypotheses). For instance it is clear from~linearity of
expectation that~\(S_n\) has mean~\[\mathbf{E}S_n = \mathbf{E}X_1 + \cdots + \mathbf{E}X_n\]
(assuming of course that~\(X_1, \dots, X_n\) are absolutely integrable)
and variance~\[\mathbf{Var}(S_n) = \mathbf{Var}(X_1) + \cdots +\mathbf{Var}(X_n) + \sum_{1\leq i,j \leq n; i\ne j} \mathrm{Cov}(X_i, X_j)\](assuming now that~\(X_1, \dots , X_n\)
are square-integrable). If the~\(X_1, \cdots, X_n\) are pairwise
independent in addition to being square-ntegrable, then all hte
covariances vanish and we obtain additivity of the
variance:\[\mathbf{Var}(S_n) = \mathbf{Var}(X_1) + \cdots + \mathbf{Var}(X_n).\]
Remark
\begin{itemize}
\tightlist
\item
Viewing the variance as the square of the standard deviation, the
above additive identity of the variance can be seen as a rigorous
demonstration of the following heuristic \emph{principle of square
root cancellation}: if one has a sum~\(S_n = X_1 + \cdots + X_n\) of random (or
pseudorandom) variables that ``oscillate'' (in the sense that their
mean is either zero or close to zero) and each~\(X_i\) has
an expected magnitude of about~\(a_i\)~(in the sense that
a statistic such as the standard deviation of~\(X_i\) is
comparable to~\(a_i\)), and the~\(X_i\)
``behave independently'', then the sum~\(S_n\) is expected
to have a magnitude of about~\((\Sigma_{i=1}^n |a_i|^2)^{1/2}\). For instance, a sum
of~\(n\) unbiased signs~\(X_i \in \left\{ -1, +1\right\}\) would be
expected to have magnitude about~\(\sqrt n\) unless
the~\(X_i\) exhibit strong mutual correlations. This
principle turns out to be applicable to a remarkably broad range of
situations, even for which no randomness is evidently perceived (e.g.
in considering the type of exponential sums that occur in analytic
number theory). We will see some further instantiation of this
principle in later notes.
\end{itemize}
\par\null
\selectlanguage{english}
\FloatBarrier
\end{document}