The basic problems of probability theory

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (328.82 KB, 49 trang )

(1)<div class='page_container' data-page=1>

Probability Theory
Richard F. Bass

These notes are c1998 by Richard F. Bass. They may be used for personal or classroom purposes,
but not for commercial purposes.

Revised 2001.

1. Basic notions.

A probability or probability measureis a measure whose total mass is one. Because the origins of
probability are in statistics rather than analysis, some of the terminology is different. For example, instead
of denoting a measure space by (X,A, µ), probabilists use (Ω,F,P). So here Ω is a set,F is called aσ-field
(which is the same thing as aσ-algebra), andPis a measure withP(Ω) = 1. Elements ofF are calledevents.
Elements of Ω are denotedω.

Instead of saying a property occurs almost everywhere, we talk about properties occurring almost
surely, written a.s.. Real-valued measurable functions from Ω to R are called random variables and are
usually denoted byX orY or other capital letters. We often abbreviate ”random variable” byr.v.

We letAc = (ω∈Ω :ω /∈A) (called thecomplementofA) andB−A=B∩Ac.

Integration (in the sense of Lebesgue) is calledexpectationor expected value, and we writeEX for
R

XdP. The notationE[X;A] is often used forR

AXdP.

The random variable 1A is the function that is one if ω ∈ A and zero otherwise. It is called the
indicatorofA(the name characteristic function in probability refers to the Fourier transform). Events such

as (ω:X(ω)> a) are almost always abbreviated by (X > a).

Given a random variableX, we can define a probability onRby

PX(A) =P(X ∈A), A⊂R. (1.1)

The probabilityPX is called thelawofX or thedistributionofX. We defineFX:R→[0,1] by

FX(x) =PX((−∞, x]) =P(X ≤x). (1.2)

The functionFX is called thedistribution functionofX.

As an example, let Ω = {H, T}, F all subsets of Ω (there are 4 of them), P(H) = P(T) = 12. Let
X(H) = 1 andX(T) = 0. Then PX = 12δ0+21δ1, whereδx is point mass atx, that is,δx(A) = 1 if x∈A

and 0 otherwise. FX(a) = 0 ifa <0, 12 if 0≤a <1, and 1 ifa≥1.

Proposition 1.1. The distribution functionFX of a random variableX satisfies:
(a) FX is nondecreasing;

(b) FX is right continuous with left limits;
(c) limx→∞FX(x) = 1andlimx→−∞FX(x) = 0.

Proof. We prove the first part of (b) and leave the others to the reader. Ifxn↓x, then (X ≤xn)↓(X ≤x),

and soP(X≤xn)↓P(X ≤x) sinceP is a measure.

Note that ifxn↑x, then (X ≤xn)↑(X < x), and soFX(xn)↑P(X < x).

</div>
(2)<div class='page_container' data-page=2>

Proposition 1.2. Suppose F is a distribution function. There exists a random variable X such that

F =FX.

Proof. Let Ω = [0,1],F the Borelσ-field, andPLebesgue measure. Define X(ω) = sup{x:F(x)< ω}. It
is routine to check thatFX =F.

In the above proof, essentially X = F−1. However F may have jumps or be constant over some
intervals, so some care is needed in definingX.

Certain distributions or laws are very common. We list some of them.

(a) Bernoulli. A random variable is Bernoulli if P(X= 1) =p,P(X = 0) = 1−pfor somep∈[0,1].
(b) Binomial. This is defined byP(X =k) =

n
k

pk(1−p)n−k, wherenis a positive integer, 0≤k≤n,

andp∈[0,1].

(c) Geometric. Forp∈(0,1) we setP(X =k) = (1−p)pk. Herek is a nonnegative integer.

(d) Poisson. Forλ >0 we setP(X=k) =e−λλk/k! Againkis a nonnegative integer.

(e) Uniform. For some positive integern, setP(X =k) = 1/nfor 1≤k≤n.

If F is absolutely continuous, we call f = F0 the density of F. Some examples of distributions
characterized by densities are the following.

(f) Uniform on[a, b]. Definef(x) = (b−a)−11

[a,b](x). This means that ifX has a uniform distribution,

then

P(X ∈A) =
Z

b−a1[a,b](x)dx.
(g) Exponential. Forx >0 letf(x) =λe−λx.

(h) Standard normal. Definef(x) =√1

2πe

−x2/2. So

P(X∈A) =
1
√

2π

e−x2/2dx.

(i) N(µ, σ2). We shall see later that a standard normal has mean zero and variance one. If Z is a

standard normal, then a N(µ, σ2) random variable has the same distribution as µ+σZ. It is an

exercise in calculus to check that such a random variable has density
1

√
2πσe

−(x−µ)2/2σ2

. (1.3)

(j) Cauchy. Here

f(x) = 1
π

1
1 +x2.

We can use the law of a random variable to calculate expectations.
Proposition 1.3. Ifg is bounded or nonnegative, then

Eg(X) =
Z

g(x)PX(dx).

Proof. Ifgis the indicator of an eventA, this is just the definition ofPX. By linearity, the result holds for

</div>
(3)<div class='page_container' data-page=3>

If FX has a density f, then PX(dx) = f(x)dx. So, for example, EX = Rxf(x)dx and EX2 =
R

x2f(x)dx. (We need

E|X|finite to justify this if X is not necessarily nonnegative.)

We define themeanof a random variable to be its expectation, and thevarianceof a random variable
is defined by

VarX=E(X−EX)2.

For example, it is routine to see that the mean of a standard normal is zero and its variance is one.
Note

VarX =E(X2−2XEX+ (EX)2) =EX2−(EX)2.

Another equality that is useful is the following.
Proposition 1.4. IfX ≥0a.s. andp >0, then

EXp=
Z ∞

pλp−1P(X > λ)dλ.

The proof will show that this equality is also valid if we replaceP(X > λ) byP(X≥λ).
Proof. Use Fubini’s theorem and write

Z ∞

pλp−1P(X > λ)dλ=E
Z ∞

pλp−11(λ,∞)(X)dλ=E

Z X

pλp−1dλ=EXp.

We need two elementary inequalities.

Proposition 1.5. Chebyshev’s inequality IfX≥0,

P(X ≥a)≤ E

X
a .

Proof. We write

P(X ≥a) =E
h

1[a,∞)(X)

≤EhX

a1[a,∞)(X)
i

≤EX/a,

sinceX/ais bigger than 1 when X∈[a,∞).

If we apply this toX = (Y −EY)2, we obtain

P(|Y −EY| ≥a) =P((Y −EY)2≥a2)≤VarY /a2. (1.4)

This special case of Chebyshev’s inequality is sometimes itself referred to as Chebyshev’s inequality, while
Proposition 1.5 is sometimes called the Markov inequality.

</div>
(4)<div class='page_container' data-page=4>

Proposition 1.6. Supposegis convex and and X andg(X)are both integrable. Then

g(EX)≤Eg(X).

Proof. One property of convex functions is that they lie above their tangent lines, and more generally their
support lines. So ifx0∈R, we have

g(x)≥g(x0) +c(x−x0)

for some constantc. Takex=X(ω) and take expectations to obtain

Eg(X)≥g(x0) +c(EX−x0).

Now setx0 equal toEX.

IfAn is a sequence of sets, define (An i.o.), read ”An infinitely often,” by

(An i.o.) =∩∞n=1∪

∞

i=nAi.

This set consists of thoseωthat are in infinitely many of theAn.

A simple but very important proposition is the Borel-Cantelli lemma. It has two parts, and we prove
the first part here, leaving the second part to the next section.

Proposition 1.7. (Borel-Cantelli lemma)IfP

nP(An)<∞, thenP(An i.o.) = 0.

Proof. We have

P(An i.o.) = lim
n→∞P(∪

∞

i=nAi).

However,

P(∪∞i=nAi)≤

∞
X

i=n

P(Ai),

which tends to zero asn→ ∞.

2. Independence.

Let us say two eventsAandB areindependentifP(A∩B) =P(A)P(B). The eventsA1, . . . , An are

independent if

P(Ai1∩Ai2∩ · · · ∩Aij) =P(Ai1)P(Ai2)· · ·P(Aij)

for every subset{i1, . . . , ij} of{1,2, . . . , n}.

Proposition 2.1. IfAandB are independent, thenAc andB are independent.

Proof. We write

P(Ac∩B) =P(B)−P(A∩B) =P(B)−P(A)P(B) =P(B)(1−P(A)) =P(B)P(A).

</div>
(5)<div class='page_container' data-page=5>

{(X ∈A) :Aa Borel subset ofR}.) We define the independence ofn σ-fields ornrandom variables in the
obvious way.

Proposition 2.1 tells us thatA and B are independent if the random variables 1A and 1B are

inde-pendent, so the definitions above are consistent.

Iff andg are Borel functions andX and Y are independent, thenf(X) and g(Y) are independent.
This follows because theσ-field generated byf(X) is a sub-σ-field of the one generated byX, and similarly
forg(Y).

LetFX,Y(x, y) =P(X ≤x, Y ≤y) denote the joint distribution function ofX and Y. (The comma
inside the set means ”and.”)

Proposition 2.2. FX,Y(x, y) =FX(x)FY(y)if and only ifX andY are independent.

Proof. IfXandY are independent, the 1(−∞,x](X) and 1(−∞,y](Y) are independent by the above comments.

Using the above comments and the definition of independence, this showsFX,Y(x, y) =FX(x)FY(y).

Conversely, if the inequality holds, fixy and letMy denote the collection of setsA for whichP(X ∈
A, Y ≤y) =P(X∈A)P(Y ≤y). My contains all sets of the form (−∞, x]. It follows by linearity thatMy

contains all sets of the form (x, z], and then by linearity again, by all sets that are the finite union of such
half-open, half-closed intervals. Note that the collection of finite unions of such intervals, A, is an algebra
generating the Borelσ-field. It is clear thatMy is a monotone class, so by the monotone class lemma,My

contains the Borelσ-field.

For a fixed set A, let MA denote the collection of sets B for which P(X ∈ A, Y ∈ B) = P(X ∈
A)P(Y ∈B). Again,MAis a monotone class and by the preceding paragraph contains theσ-field generated

by the collection of finite unions of intervals of the form (x, z], hence contains the Borel sets. ThereforeX
andY are independent.

The following is known as the multiplication theorem.

Proposition 2.3. IfX,Y, andXY are integrable andX and Y are independent, thenEXY =EXEY.
Proof. Consider the random variables in σ(X) (the σ-field generated by X) and σ(Y) for which the
multiplication theorem is true. It holds for indicators by the definition of X and Y being independent.
It holds for simple random variables, that is, linear combinations of indicators, by linearity of both sides.
It holds for nonnegative random variables by monotone convergence. And it holds for integrable random
variables by linearity again.

Let us give an example of independent random variables. Let Ω = Ω1×Ω2and letP=P1×P2, where

(Ωi,Fi,Pi) are probability spaces fori= 1,2. We use the product σ-field. Then it is clear thatF1 and F2

are independent by the definition ofP. IfX1is a random variable such thatX1(ω1, ω2) depends only onω1

andX2 depends only onω2, then X1 andX2 are independent.

This example can be extended tonindependent random variables, and in fact, if one has independent
random variables, one can always view them as coming from a product space. We will not use this fact.
Later on, we will talk about countable sequences of independent r.v.s and the reader may wonder whether
such things can exist. That it can is a consequence of the Kolmogorov extension theorem; see PTA, for
example.

If X1, . . . , Xn are independent, then so are X1−EX1, . . . , Xn −EXn. Assuming everything is

integrable,

</div>
(6)<div class='page_container' data-page=6>

using the multiplication theorem to show that the expectations of the cross product terms are zero. We have
thus shown

Var (X1+· · ·+Xn) = VarX1+· · ·+ VarXn. (2.1)

We finish up this section by proving the second half of the Borel-Cantelli lemma.

Proposition 2.4. SupposeAnis a sequence of independent events. IfPnP(An) =∞, thenP(An i.o.) = 1.

Note that here the An are independent, while in the first half of the Borel-Cantelli lemma no such

assumption was necessary.
Proof. Note

P(∪Ni=nAi) = 1−P(∩Ni=nAci) = 1−
N

i=n

P(Aci) = 1−
N

i=n

(1−P(Ai)).

By the mean value theorem, 1−x≤e−x, so we have that the right hand side is greater than or equal to

1−exp(−PN

i=nP(Ai)).As N → ∞, this tends to 1, soP(∪∞i=nAi) = 1. This holds for alln, which proves

the result.

3. Convergence.

In this section we consider three ways a sequence of r.v.sXn can converge.

We say Xn converges toX almost surelyif (Xn 6→X) has probability zero. Xn converges toX in
probability if for eachε, P(|Xn−X|> ε)→0 asn→ ∞. Xn converges to X in Lp ifE|Xn−X|p →0 as

n→ ∞.

The following proposition shows some relationships among the types of convergence.
Proposition 3.1. (a) IfXn→X a.s., thenXn→X in probability.

(b) If Xn→X in Lp, thenXn→X in probability.

Proof. To prove (a), noteXn−X tends to 0 almost surely, so 1(−ε,ε)c(Xn−X) also converges to 0 almost

surely. Now apply the dominated convergence theorem.
(b) comes from Chebyshev’s inequality:

P(|Xn−X|> ε) =P(|Xn−X|p> εp)≤E|Xn−X|p/εp→0

asn→ ∞.

To prove (c), choose nj larger than nj−1 such that P(|Xn−X| > 2−j) < 2−j whenever n ≥ nj.

So if we let Ai = (|Xnj −X| > 2

−i for somej ≥ i), then

P(Ai) ≤ 2−i+1. By the Borel-Cantelli lemma

P(Ai i.o.) = 0. This impliesXnj →X on the complement of (Ai i.o.).

Let us give some examples to show there need not be any other implications among the three types
of convergence.

Let Ω = [0,1],F the Borelσ-field, andP Lebesgue measure. LetXn =en1(0,1/n). Then clearlyXn

converges to 0 almost surely and in probability, butEXp

n=enp/n→ ∞for anyp.

Let Ω be the unit circle, and letP be Lebesgue measure on the circle normalized to have total mass
1. Let tn =Pni=1i−1, and letAn ={θ:tn−1 ≤θ < tn}. LetXn = 1An. Any point on the unit circle will

be in infinitely manyAn, so Xn does not converge almost surely to 0. ButP(An) = 1/2πn→0, soXn→0

</div>
(7)<div class='page_container' data-page=7>

4. Weak law of large numbers.

SupposeXn is a sequence of independent random variables. Suppose also that they all have the same

distribution, that is, FXn =FX1 for all n. This situation comes up so often it has a name, independent,

identically distributed, which is abbreviatedi.i.d.

DefineSn=P
n

i=1Xi. Sn is called a partial sum process. Sn/n is the average value of the firstnof

theXi’s.

Theorem 4.1. (Weak law of large numbers) Suppose theXi are i.i.d. andEX12<∞. ThenSn/n→EX1
in probability.

Proof. Since theXi are i.i.d., they all have the same expectation, and soESn=nEX1. HenceE(Sn/n−

EX1)2 is the variance ofSn/n. Ifε >0, by Chebyshev’s inequality,

P(|Sn/n−EX1|> ε)≤

Var (Sn/n)

ε2 =

i=1VarXi

n2ε2 =

nVarX1

n2ε2 . (4.1)

SinceEX12<∞, then VarX1<∞, and the result follows by lettingn→ ∞.

A nice application of the weak law of large numbers is a proof of the Weierstrass approximation
theorem.

Theorem 4.2. Supposef is a continuous function on [0,1]and ε >0. There exists a polynomialP such
thatsupx∈[0,1]|f(x)−P(x)|< ε.

Proof. Let

P(x) =

k=0

f(k/n)

n
k

xk(1−x)n−k.

ClearlyP is a polynomial. Sincef is continuous, there exists M such that|f(x)| ≤M for allxand there
existsδsuch that |f(x)−f(y)|< ε/2 whenever|x−y|< δ.

LetXi be i.i.d. Bernoulli r.v.s with parameterx. ThenSn, the partial sum, is a binomial, and hence

P(x) =Ef(Sn/n). The mean ofSn/n isx. We have

≤MP(|Sn/n−x|> δ) +ε/2.

By (4.1) the first term will be less than

MVarX1/nδ2≤M x(1−x)/nδ2≤M nδ2,

which will be less thanε/2 ifnis large enough, uniformly inx.

5. Techniques related to almost sure convergence.

Our aim is the strong law of large numbers (SLLN), which says thatSn/n converges toEX1 almost

surely ifE|X1|<∞.

</div>
(8)<div class='page_container' data-page=8>

Proposition 5.1. SupposeXi is an i.i.d. sequence withEXi4 <∞and letSn =P
n

i=1Xi. ThenSn/n→

EX1 a.s.

Proof. By looking atXi−EXi we may assume that theXi have mean 0. By Chebyshev,

P(|Sn/n|> ε)≤ E

(Sn/n)4

ε4 =

ESn4

n4ε4.

If we expand Sn4, we will have terms involving Xi4, terms involvingXi2Xj2, terms involving Xi3Xj, terms

involvingXi2XjXk, and terms involvingXiXjXkX`, withi, j, k, `all being different. By the multiplication

theorem and the fact that theXi have mean 0, the expectations of all the terms will be 0 except for those

of the first two types. So

ESn4 =
n

i=1

EXi4+

i6=j

EXi2EXj2.

By the finiteness assumption, the first term on the right is bounded byc1n. By Cauchy-Schwarz,EXi2 ≤

(EXi4)

1/2 < ∞, and there are at most n2 terms in the second term on the right, so this second term is

bounded byc2n2. Substituting, we have

P(|Sn/n|> ε)≤c3/n2ε4.

ConsequentlyP(|Sn/n|> εi.o.) = 0 by Borel-Cantelli. Sinceεis arbitrary, this impliesSn/n→0 a.s.

Before we can prove the SLLN assuming only the finiteness of first moments, we need some
prelimi-naries.

Proposition 5.2. IfY ≥0, thenEY <∞if and only ifPnP(Y > n)<∞.

Proof. By Proposition 1.4, EY = R∞

0 P(Y > x)dx. P(Y > x) is nonincreasing in x, so the integral is

bounded above byP∞n=0P(Y > n) and bounded below byP∞n=1P(Y > n).

If Xi is a sequence of r.v.s, the tail σ-field is defined by ∩∞n=1σ(Xn, Xn+1, . . .). An example of an

event in the tailσ-field is (lim supn→∞Xn > a). Another example is (lim supn→∞Sn/n > a). The reason

for this is that ifk < nis fixed,

n =
Sk

n +
Pn

i=k+1Xi

n .

The first term on the right tends to 0 as n → ∞. So lim supSn/n = lim sup(P

i=k+1Xi)/n, which is in

σ(Xk+1, Xk+2, . . .). This holds for eachk. The set (lim supSn> a) is easily seen not to be in the tailσ-field.

Theorem 5.3. (Kolmogorov 0-1 law)If the Xi are independent, then the events in the tail σ-field have
probability 0 or 1.

This implies that in the case of i.i.d. random variables, ifSn/n has a limit with positive probability,

then it has a limit with probability one, and the limit must be a constant.

Proof. LetM be the collection of sets inσ(Xn+1, . . .) that is independent of every set inσ(X1, . . . , Xn).

M is easily seen to be a monotone class and it containsσ(Xn+1, . . . , XN) for every N > n. Therefore M

must be equal toσ(Xn+1, . . .).

IfAis in the tail σ-field, then for eachn, Ais independent ofσ(X1, . . . , Xn). The class MA of sets

independent ofAis a monotone class, hence is aσ-field containingσ(X1, . . . , Xn) for eachn. ThereforeMA

</div>
(9)<div class='page_container' data-page=9>

We thus have that the eventA is independent of itself, or

P(A) =P(A∩A) =P(A)P(A) =P(A)2.
This impliesP(A) is zero or one.

The next proposition shows that in considering a law of large numbers we can consider truncated
random variables.

Proposition 5.4. SupposeXi is an i.i.d. sequence of r.v.s withE|X1|<∞. LetXn0 =Xn1(|Xn|≤n). Then

(a) Xn converges almost surely if and only ifXn0 does;
(b) If Sn0 =Pn

i=1X

i, thenSn/nconverges a.s. if and only ifSn0/ndoes.

Proof. LetAn= (Xn 6=Xn0) = (|Xn|> n). ThenP(An) =P(|Xn|> n) =P(|X1|> n). Since E|X1|<∞,

then by Proposition 5.2 we have P

P(An)<∞. So by the Borel-Cantelli lemma, P(An i.o.) = 0. Thus for

almost everyω,Xn=Xn0 fornsufficiently large. This proves (a).

For (b), letk(depending onω) be the largest integer such thatXk0(ω)6=Xk(ω). ThenSn/n−Sn0/n=

(X1+· · ·+Xk)/n−(X10 +· · ·+Xk0)/n→0 asn→ ∞.

Next is Kolmogorov’s inequality, a special case of Doob’s inequality.

Proposition 5.5. Suppose the Xi are independent andEXi = 0for eachi. Then

P( max

1≤i≤n|Si| ≥λ)≤

ESn2

λ2 .

Proof. LetAk = (|Sk| ≥λ,|S1|< λ, . . . ,|Sk−1|< λ). Note theAkare disjoint and thatAk∈σ(X1, . . . , Xk).

ThereforeAk is independent ofSn−Sk. Then

ESn2≥
n

k=1

E[Sn2;Ak]

k=1

E[(Sk2+ 2Sk(Sn−Sk) + (Sn−Sk)2);Ak]

≥

k=1

E[Sk2;Ak] + 2
n

k=1

E[Sk(Sn−Sk);Ak].

Using the independence,E[Sk(Sn−Sk)1Ak] =E[Sk1Ak]E[Sn−Sk] = 0. Therefore

ESn2 ≥
n

k=1

E[S2k;Ak]≥
n

k=1

λ2P(Ak) =λ2P( max

1≤k≤n|Sk| ≥λ).

Our result is immediate from this.

The last result we need for now is a special case of what is known as Kronecker’s lemma.
Proposition 5.6. Supposexiare real numbers andsn=P

n
i=1xi. If

P∞

j=1(xj/j)converges, thensn/n→0.

Proof. Letbn =P
n

j=1(xj/j),b0= 0, and supposebn→b. As is well known, this implies (P
n

i=1bi)/n→b.

We haven(bn−bn−1) =xn, so

n =
Pn

i=1(ibi−ibi−1)

n =

i=1ibi−P
n−1

i=1(i+ 1)bi

n =bn−

i=1bi

</div>
(10)<div class='page_container' data-page=10>

6. Strong law of large numbers.

This section is devoted to a proof of Kolmogorov’s strong law of large numbers. We showed earlier
that ifEXi2<∞, where theXiare i.i.d., then the weak law of large numbers (WLLN) holds: Sn/nconverges

toEX1in probability. The WLLN can be improved greatly; it is enough thatxP(|X1|> x)→0 asx→ ∞.

Here we show the strong law (SLLN): if one has a finite first moment, then there is almost sure convergence.
First we need a lemma.

Lemma 6.1. Suppose Vi is a sequence of independent r.v.s, each with mean 0. Let Wn = P
n

i=1Vi. If

P∞

i=1VarVi<∞, thenWn converges almost surely.

Proof. Choosenj> nj−1such that

P∞

i=njVarVi<2

−3j. Ifn > n

j, then applying Kolmogorov’s inequality

shows that

P( max

nj≤i≤n

|Wi−Wnj|>2

−j)≤2−3j/2−2j = 2−j.

Lettingn→ ∞, we haveP(Aj)≤2−j, where

Aj = (max
nj≤i

|Wi−Wnj|>2

−j).

By the Borel-Cantelli lemma,P(Aj i.o.) = 0.

Supposeω /∈(Aj i.o.). Letε >0. Choosejlarge enough so that 2−j+1< εandω /∈Aj. Ifn, m > nj,

then

|Wn−Wm| ≤ |Wn−Wnj|+|Wm−Wnj| ≤2

−j+1< ε.

Sinceεis arbitrary,Wn(ω) is a Cauchy sequence, and hence converges.

Theorem 6.2. (SLLN)LetXibe a sequence of i.i.d. random variables. ThenSn/nconverges almost surely
if and only ifE|X1|<∞.

Proof. Let us first supposeSn/n converges a.s. and showE|X1|<∞. IfSn(ω)/n→a, then

Sn−1

n =
Sn−1

n−1
n−1

n →a.
So

n =
Sn

n −
Sn−1

n →a−a= 0.

HenceXn/n→0, a.s. ThusP(|Xn|> ni.o.) = 0. By the second part of Borel-Cantelli,PP(|Xn|> n)<∞.

Since theXi are i.i.d., this means

P∞

n=1P(|X1|> n)<∞, and by Proposition 4.1,E|X1|<∞.

Now suppose E|X1|<∞. By looking atXi−EXi, we may suppose without loss of generality that

EXi= 0. We truncate, and letYi=Xi1(|Xi|≤i). It suffices to show

i=1Yi/n→0 a.s., by Proposition 5.4.

Next we estimate. We have

</div>
(11)<div class='page_container' data-page=11>

The convergence follows by the dominated convergence theorem, since the integrands are bounded by|X1|.

To estimate the second moment of theYi, we write

EYi2=

Z ∞

2yP(|Yi| ≥y)dy

=
Z i

2yP(|Yi| ≥y)dy

≤
Z i

2yP(|X1| ≥y)dy,

and so

∞
X

i=1

E(Yi2/i
2)≤

∞
X

i=1

1
i2

Z i

2yP(|X1| ≥y)dy

= 2
∞
X

i=1

1
i2

Z ∞

1(y≤i)yP(|X1| ≥y)dy

= 2
Z ∞

∞
X

i=1

i21(y≤i)yP(|X1| ≥y)dy

≤4
Z ∞

yyP(|X1| ≥y)dy
= 4

Z ∞

P(|X1| ≥y)dy= 4E|X1|<∞.

LetUi=Yi−EYi. Then VarUi= VarYi≤EYi2, and by the above,

∞
X

i=1

Var (Ui/i)<∞.

By Lemma 6.1 (withVi=Ui/i),P
n

i=1(Ui/i) converges almost surely. By Kronecker’s lemma, (P
n

i=1Ui)/n

converges almost surely. Finally, sinceEYi→0, thenP
n

i=1EYi/n→0, henceP

i=1Yi/n→0.

7. Uniform integrability. Before proceeding to some extensions of the SLLN, we discuss uniform
inte-grability. A sequence of r.v.s isuniformly integrable if

sup

(|Xi|>M)

|Xi|dP→0

asM → ∞.

Proposition 7.1. Suppose there existsϕ: [0,∞)→[0,∞)such that ϕis nondecreasing, ϕ(x)/x→ ∞as

x→ ∞, andsupiEϕ(|Xi|)<∞. Then theXi are uniformly integrable.

Proof. Letε >0 and choosex0 such thatx/ϕ(x)< εifx≥x0. IfM ≥x0,

(|Xi|>M)

|Xi|=

Z |X

ϕ(|Xi|)

ϕ(|Xi|)1(|Xi|>M)≤ε

ϕ(|Xi|)≤εsup
i E

</div>
(12)<div class='page_container' data-page=12>

Proposition 7.2. IfXn andYn are two uniformly integrable sequences, thenXn+Yn is also a uniformly
integrable sequence.

Proof. Since there exists M0 such that supnE[|Xn|;|Xn| > M0] < 1 and supnE[|Yn|;|Yn| > M0] < 1,

then supnE|Xn| ≤ M0+ 1, and similarly for theYn. Let ε >0 and choose M1 >4(M0+ 1)/ε such that

supnE[|Xn|;|Xn|> M1]< ε/4 and supnE[|Yn|;|Yn|> M1]< ε/4. LetM2= 4M12.

NoteP(|Xn|+|Yn|> M2)≤(E|Xn|+E|Yn|)/M2≤ε/(4M1) by Chebyshev’s inequality. Then

E[|Xn+Yn|;|Xn+Yn|> M2]≤E[|Xn|;|Xn|> M1]

+E[|Xn|;|Xn| ≤M1,|Xn+Yn|> M2]

+E[|Yn|;|Yn|> M1]

+E[|Yn|;|Yn| ≤M1,|Xn+Yn|> M2].

The first and third terms on the right are each less than ε/4 by our choice of M1. The second and fourth

terms are each less thanM1P(|Xn+Yn|> M2)≤ε/2.

The main result we need in this section is Vitali’s convergence theorem.

Theorem 7.3. IfXn→X almost surely and theXn are uniformly integrable, thenE|Xn−X| →0.

Proof. By the above proposition, Xn−X is uniformly integrable and tends to 0 a.s., so without loss of

generality, we may assumeX = 0. Letε >0 and chooseM such that supnE[|Xn|;|Xn|> M]< ε. Then

E|Xn| ≤E[|Xn|;|Xn|> M] +E[|Xn|;|Xn| ≤M]≤ε+E[|Xn|1(|Xn|≤M)].

The second term on the right goes to 0 by dominated convergence.

8. Complements to the SLLN.

Proposition 8.1. SupposeXi is an i.i.d. sequence andE|X1|<∞. Then

n −EX1

→0.

Proof. Without loss of generality we may assumeEX1= 0. By the SLLN, Sn/n →0 a.s. So we need to

show that the sequenceSn/n is uniformly integrable.

PickM1such that E[|X1|;|X1|> M1]< ε/2. PickM2=M1E|X1|/ε. So

P(|Sn/n|> M2)≤E|Sn|/nM2≤E|X1|/M2=ε/M1.

We used hereE|Sn| ≤P
n

i=1E|Xi|=nE|X1|.

We then have

E[|Xi|;|Sn/n|> M2]≤E[|Xi|:|Xi|> M1] +E[|Xi|;|Xi| ≤M1,|Sn/n|> M2]

≤ε+M1P(|Sn/n|> M2)≤2ε.

Finally,

E[|Sn/n|;|Sn/n|> M2]≤

1
n

i=1

E[|Xi|;|Sn/n|> M2]≤2ε.

</div>
(13)<div class='page_container' data-page=13>

Theorem 8.2. LetXi be a sequence of independent random variables.,A >0, andYi=Xi1(|Xi|≤A). Then

PX

i converges if and only if all of the following three series converge: (a)PP(|Xn|> A); (b) PEYi; (c)

PVarY

Proof of “if” part. Since (c) holds, thenP(Y

i−EYi) converges by Lemma 6.1. Since (b) holds, taking the

difference showsPY

iconverges. Since (a) holds,PP(Xi6=Yi) =PP(|Xi|> A)<∞, so by Borel-Cantelli,

P(Xi6=Yi i.o.) = 0. It follows thatPXi converges.

9. Conditional expectation.

If F ⊆ G are two σ-fields and X is an integrable G measurable random variable, the conditional
expectationof X given F, written E[X | F] and read as “the expectation (or expected value) of X given
F,” is anyF measurable random variableY such thatE[Y;A] =E[X;A] for everyA∈ F. Theconditional

probabilityofA∈ G givenF is defined byP(A| F) =E[1A| F].

IfY1, Y2are twoFmeasurable random variables withE[Y1;A] =E[Y2;A] for allA∈ F, thenY1=Y2,

a.s., or conditional expectation is unique up to a.s. equivalence.

In the caseX is alreadyF measurable, E[X | F] =X. IfX is independent of F,E[X | F] =EX.
Both of these facts follow immediately from the definition. For another example, which ties this definition
with the one used in elementary probability courses, if{Ai}is a finite collection of disjoint sets whose union

is Ω,P(Ai)>0 for alli, andF is theσ-field generated by theAis, then

P(A| F) =
X

P(A∩Ai)

P(Ai)

1Ai.

This follows since the right-hand side isF measurable and its expectation over any setAi isP(A∩Ai).

As an example, suppose we toss a fair coin independently 5 times and let Xi be 1 or 0 depending

whether the ith toss was a heads or tails. Let A be the event that there were 5 heads and let Fi =

σ(X1, . . . , Xi). ThenP(A) = 1/32 whileP(A| F1) is equal to 1/16 on the event (X1= 1) and 0 on the event

(X1= 0). P(A| F2) is equal to 1/8 on the event (X1= 1, X2= 1) and 0 otherwise.

We have

E[E[X | F]] =EX (9.1)

becauseE[E[X| F]] =E[E[X | F]; Ω] =E[X; Ω] =EX.
The following is easy to establish.

Proposition 9.1. (a) IfX ≥Y are both integrable, thenE[X | F]≥E[Y | F]a.s.

(b) If X andY are integrable and a∈R, thenE[aX+Y | F] =aE[X | F] +E[Y | F].

It is easy to check that limit theorems such as monotone convergence and dominated convergence
have conditional expectation versions, as do inequalities like Jensen’s and Chebyshev’s inequalities. Thus,
for example, we have the following.

Proposition 9.2. (Jensen’s inequality for conditional expectations) If g is convex and X and g(X) are
integrable,

</div>
(14)<div class='page_container' data-page=14>

Proposition 9.3. IfX andXY are integrable and Y is measurable with respect toF, then

E[XY | F] =YE[X | F]. (9.2)

Proof. IfA∈ F, then for anyB∈ F,

E1AE[X| F];B=EE[X| F];A∩B=E[X;A∩B] =E[1AX;B].

Since 1AE[X | F] isF measurable, this shows that (9.1) holds whenY = 1A and A∈ F. Using linearity

and taking limits shows that (9.1) holds wheneverY isF measurable andXand XY are integrable.
Two other equalities follow.

Proposition 9.4. IfE ⊆ F ⊆ G, then

EE[X | F] | E=E[X| E] =EE[X | E] | F.

Proof. The right equality holds becauseE[X | E] isE measurable, henceF measurable. To show the left
equality, letA∈ E. Then sinceA is also inF,

EEE[X | F] | E;A=EE[X | F];A=E[X;A] =E[EX | E];A.
Since both sides areE measurable, the equality follows.

To show the existence ofE[X | F], we proceed as follows.
Proposition 9.5. IfX is integrable, thenE[X | F]exists.

Proof. Using linearity, we need only consider X ≥0. Define a measure Q onF by Q(A) =E[X;A] for
A∈ F. This is trivially absolutely continuous with respect to P|F, the restriction of Pto F. LetE[X | F]
be the Radon-Nikodym derivative ofQwith respect toP|F. The Radon-Nikodym derivative isF measurable
by construction and so provides the desired random variable.

WhenF=σ(Y), one usually writesE[X|Y] forE[X | F]. Notation that is commonly used (however,
we will use it only very occasionally and only for heuristic purposes) is E[X |Y =y]. The definition is as
follows. IfA∈σ(Y), thenA= (Y ∈B) for some Borel setB by the definition ofσ(Y), or 1A= 1B(Y). By

linearity and taking limits, ifZ is σ(Y) measurable, Z =f(Y) for some Borel measurable function f. Set
Z=E[X |Y] and choosef Borel measurable so thatZ=f(Y). ThenE[X |Y =y] is defined to bef(y).

If X ∈ L2 and M = {Y ∈ L2 : Y isF-measurable}, one can show that

E[X | F] is equal to the
projection ofX onto the subspaceM. We will not use this in these notes.

10. Stopping times.

We next want to talk about stopping times. Suppose we have a sequence of σ-fields Fi such that

Fi ⊂ Fi+1 for each i. An example would be if Fi = σ(X1, . . . , Xi). A random mapping N from Ω to

{0,1,2, . . .}is called astopping time if for eachn, (N≤n)∈ Fn. A stopping time is also called an optional

time in the Markov theory literature.

The intuition is that the sequence knows whether N has happened by time n by looking at Fn.

Suppose some motorists are told to drive north on Highway 99 in Seattle and stop at the first motorcycle
shop past the second realtor after the city limits. So they drive north, pass the city limits, pass two realtors,
and come to the next motorcycle shop, and stop. That is a stopping time. If they are instead told to stop
at the third stop light before the city limits (and they had not been there before), they would need to drive
to the city limits, then turn around and return past three stop lights. That is not a stopping time, because
they have to go ahead of where they wanted to stop to know to stop there.

</div>
(15)<div class='page_container' data-page=15>

Proposition 10.1.

(a) Fixed timesnare stopping times.

(b) If N1and N2are stopping times, then so areN1∧N2andN1∨N2.

(c) IfNn is a nondecreasing sequence of stopping times, then so isN = supnNn.
(d) If Nn is a nonincreasing sequence of stopping times, then so isN = infnNn.

(e) IfN is a stopping time, then so isN+n.

We defineFN ={A:A∩(N ≤n)∈ Fn for alln}.

11. Martingales.

In this section we consider martingales. LetFn be an increasing sequence ofσ-fields. A sequence of

random variablesMn isadapted toFn if for eachn,Mn isFn measurable.

Mn is amartingaleifMn is adapted toFn, Mn is integrable for alln, and

E[Mn| Fn−1] =Mn−1, a.s., n= 2,3, . . . . (11.1)

If we have E[Mn | Fn−1]≥Mn−1 a.s. for every n, thenMn is a submartingale. If we have E[Mn |

Fn−1]≤Mn−1, we have a supermartingale. Submartingales have a tendency to increase.

Let us take a moment to look at some examples. If Xi is a sequence of mean zero i.i.d. random

variables and Sn is the partial sum process, then Mn =Sn is a martingale, sinceE[Mn | Fn−1] =Mn−1+

E[Mn−Mn−1| Fn−1] =Mn−1+E[Mn−Mn−1] = Mn−1, using independence. If theXi’s have variance

one andMn=Sn2−n, then

E[Sn2 | Fn−1] =E[(Sn−Sn−1)2| Fn−1] + 2Sn−1E[Sn| Fn−1]−S2n−1= 1 +S
2
n−1,

using independence. It follows thatMn is a martingale.

Another example is the following: ifX ∈L1 andMn=E[X | Fn], thenMn is a martingale.

IfMn is a martingale andHn∈ Fn−1 for eachn, it is easy to check thatNn=P
n

i=1Hi(Mi−Mi−1)

is also a martingale.

IfMn is a martingale and g(Mn) is integrable for eachn, then by Jensen’s inequality

E[g(Mn+1)| Fn]≥g(E[Mn+1| Fn]) =g(Mn),

or g(Mn) is a submartingale. Similarly if g is convex and nondecreasing on [0,∞) and Mn is a positive

submartingale, theng(Mn) is a submartingale because

E[g(Mn+1)| Fn]≥g(E[Mn+1 |Fn])≥g(Mn).

12. Optional stopping.

Note that if one takes expectations in (11.1), one hasEMn=EMn−1, and by inductionEMn=EM0.

The theorem about martingales that lies at the basis of all other results is Doob’s optional stopping theorem,
which says that the same is true if we replacenby a stopping timeN. There are various versions, depending
on what conditions one puts on the stopping times.

Theorem 12.1. IfN is a bounded stopping time with respect toFn andMn a martingale, then EMN =

EM0.

Proof. SinceN is bounded, letKbe the largest valueN takes. We write
EMN =

k=0

E[MN;N =k] =
K

k=0

</div>
(16)<div class='page_container' data-page=16>

Note (N =k) isFj measurable ifj ≥k, so

E[Mk;N =k] =E[Mk+1;N =k]

=E[Mk+2;N =k] =. . .=E[MK;N =k].

Hence

EMN =
K

k=0

E[MK;N =k] =EMK=EM0.

This completes the proof.

The assumption thatN be bounded cannot be entirely dispensed with. For example, letMn be the

partial sums of a sequence of i.i.d. random variable that take the values ±1, each with probability 1
2. If

N = min{i:Mi= 1}, we will see later on thatN <∞a.s., butEMN = 16= 0 =EM0.

The same proof as that in Theorem 12.1 gives the following corollary.

Corollary 12.2. IfN is bounded byKand Mn is a submartingale, thenEMN ≤EMK.

Also the same proof gives

Corollary 12.3. IfN is bounded byK,A∈ FN, andMnis a submartingale, thenE[MN;A]≤E[MK;A].

Proposition 12.4. IfN1 ≤N2 are stopping times bounded by K and M is a martingale, thenE[MN2 |

FN1] =MN1, a.s.

Proof. SupposeA∈ FN1. We need to showE[MN1;A] =E[MN2;A]. Define a new stopping time N3by

N3(ω) =

N1(ω) ifω∈A

N2(ω) ifω /∈A.

It is easy to check thatN3 is a stopping time, soEMN3 =EMK =EMN2 implies

E[MN1;A] +E[MN2;A

c] =

E[MN2].

SubtractingE[MN2;A

c] from each side completes the proof.

The following is known as the Doob decomposition.

Proposition 12.5. SupposeXk is a submartingale with respect to an increasing sequence of σ-fields Fk.
Then we can writeXk=Mk+Ak such thatMk is a martingale adapted to theFk andAk is a sequence of

random variables withAk being Fk−1-measurable andA0≤A1≤ · · ·.

Proof. Letak =E[Xk | Fk−1]−Xk−1fork= 1,2, . . .SinceXk is a submartingale, then eachak≥0. Then

letAk =P
k

i=1ai. The fact that the Ak are increasing and measurable with respect to Fk−1 is clear. Set

Mk =Xk−Ak. Then

E[Mk+1−Mk| Fk] =E[Xk+1−Xk|Fk]−ak+1= 0,

orMk is a martingale.

</div>
(17)<div class='page_container' data-page=17>

Corollary 12.6. SupposeXk is a submartingale, andN1≤N2 are bounded stopping times. Then

E[XN2 | FN1]≥XN1.

13. Doob’s inequalities.

The first interesting consequences of the optional stopping theorems are Doob’s inequalities. IfMn

is a martingale, denoteM∗

n = maxi≤n|Mi|.

Theorem 13.1. IfMn is a martingale or a positive submartingale,

P(Mn∗≥a)≤E[|Mn|;Mn∗≥a]/a≤E|Mn|/a.

Proof. SetMn+1=Mn. LetN = min{j:|Mj| ≥a} ∧(n+ 1). Since| · |is convex,|Mn|is a submartingale.

IfA= (Mn∗≥a), thenA∈ FN and by Corollary 12.3

aP(Mn∗≥a)≤E[|MN|;A]≤E[|Mn|;A]≤E|Mn|.

Forp >1, we have the following inequality.

Theorem 13.2. Ifp >1 andE|Mi|p<∞fori≤n, then

E(Mn∗)

p≤ p

p−1
p

E|Mn|p.

Proof. NoteMn∗≤Pn

i=1|Mn|, henceMn∗∈Lp. We write

E(Mn∗)
p=Z

∞

pap−1P(Mn∗> a)da≤
Z ∞

pap−1E[|Mn|1(M∗

n≥a)/a]da

=E
Z Mn∗

pap−2|Mn|da=

p−1E[(M
∗

n)
p−1

|Mn|]

≤ p

p−1(E(M

∗

n)p)(p−1)/p(E|Mn|p)1/p.

The last inequality follows by Hăolders inequality. Now divide both sides by the quantity (E(Mn∗)p)(p−1)/p.

14. Martingale convergence theorems.

The martingale convergence theorems are another set of important consequences of optional stopping.
The main step is the upcrossing lemma. The number of upcrossings of an interval [a, b] is the number of
times a process crosses from belowato aboveb.

To be more exact, let

S1= min{k:Xk ≤a}, T1= min{k > S1:Xk≥b},

and

Si+1= min{k > Ti :Xk ≤a}, Ti+1= min{k > Si+1:Xk≥b}.

</div>
(18)<div class='page_container' data-page=18>

Theorem 14.1. (Upcrossing lemma)IfXk is a submartingale,

EUn≤(b−a)−1E[(Xn−a)+].

Proof. The number of upcrossings of [a, b] byXk is the same as the number of upcrossings of [0, b−a]

byYk = (Xk−a)+. MoreoverYk is still a submartingale. If we obtain the inequality for the the number of

upcrossings of the interval [0, b−a] by the processYk, we will have the desired inequality for upcrossings of

So we may assumea= 0. Fixnand defineYn+1=Yn. This will still be a submartingale. Define the

Si,Ti as above, and letS0i=Si∧(n+ 1), Ti0=Ti∧(n+ 1). SinceTi+1> Si+1> Ti, thenTn+10 =n+ 1.

We write

EYn+1=EYS0
1+

n+1

i=0

E[YT0
i −YS

0
i] +

n+1

i=0

E[YS0

i+1−YT

0
i].

All the summands in the third term on the right are nonnegative sinceYk is a submartingale. For thejth

upcrossing,YT0
j−YS

j ≥b−a, whileYT
0
j −YS

j is always greater than or equal to 0. So

∞
X

i=0

(YT0

i −YSi0)≥(b−a)Un.

(4.8) EUn≤EYn+1/(b−a).

This leads to the martingale convergence theorem.

Theorem 14.2. IfXn is a submartingale such thatsupnEXn+<∞, thenXn converges a.s. asn→ ∞.

Proof. LetU(a, b) = limn→∞Un. For each a, brational, by monotone convergence,

EU(a, b)≤c(b−a)−1E(Xn−a)+<∞.

SoU(a, b)<∞, a.s. Taking the union over all pairs of rationals a, b, we see that a.s. the sequenceXn(ω)

cannot have lim supXn > lim infXn. Therefore Xn converges a.s., although we still have to rule out the

possibility of the limit being infinite. SinceXn is a submartingale,EXn≥EX0, and thus

E|Xn|=EXn++EXn−= 2EX
+

n −EXn ≤2EXn+−EX0.

By Fatou’s lemma,Elimn|Xn| ≤supnE|Xn|<∞, orXn converges a.s. to a finite limit.

Corollary 14.3. IfXnis a positive supermartingale or a martingale bounded above or below,Xnconverges
a.s.

Proof. If Xn is a positive supermartingale, −Xn is a submartingale bounded above by 0. Now apply

</div>
(19)<div class='page_container' data-page=19>

If Xn is a martingale bounded above, by considering −Xn, we may assume Xn is bounded below.

Looking atXn+M for fixedM will not affect the convergence, so we may assumeXn is bounded below by

0. Now apply the first assertion of the corollary.

Proposition 14.4. IfXn is a martingale withsupnE|Xn|p<∞for somep >1, then the convergence is in

Lp as well as a.s. This is also true whenX

n is a submartingale. If Xn is a uniformly integrable martingale,
then the convergence is inL1. IfX

n→X∞in L1, thenXn=E[X∞| Fn].

Xn is a uniformly integrable martingale if the collection of random variables Xn is uniformly

inte-grable.

Proof. TheLp convergence assertion follows by using Doob’s inequality (Theorem 13.2) and dominated

convergence. TheL1convergence assertion follows since a.s. convergence together with uniform integrability

impliesL1 convergence. Finally, ifj < n, we haveX

j =E[Xn| Fj]. IfA∈ Fj,

E[Xj;A] =E[Xn;A]→E[X∞;A]
by theL1convergence ofX

n toX∞. Since this is true for allA∈ Fj,Xj=E[X∞| Fj].

15. Applications of martingales.

LetSn be your fortune at timen. In a fair casino,E[Sn+1 | Fn] =Sn. IfN is a stopping time, the

optional stopping theorem says that ESN =ES0; in other words, no matter what stopping time you use

and what method of betting, you will do not better on average than ending up with what you started with.
An elegant application of martingales is a proof of the SLLN. Fix N large. Let Yi be i.i.d. Let

Zn = E[Y1 | Sn, Sn+1, . . . , SN]. We claim Zn = Sn/n. Certainly Sn/n is σ(Sn,· · ·, SN) measurable. If

A∈σ(Sn, . . . , SN) for somen, thenA= ((Sn, . . . , SN)∈B) for some Borel subsetB ofRN−n+1. Since the
Yi are i.i.d., for eachk≤n,

E[Y1; (Sn, . . . , SN)∈B] =E[Yk; (Sn, . . . , SN)∈B].

Summing overk and dividing byn,

E[Y1; (Sn, . . . , SN)∈B] =E[Sn/n; (Sn, . . . , SN)∈B].

ThereforeE[Y1;A] =E[Sn/n;A] for everyA∈σ(Sn, . . . , SN). ThusZn =Sn/n.

Let Xk = ZN−k, and let Fk = σ(SN−k, SN−k+1, . . . , SN). Note Fk gets larger as k gets larger,

and by the above Xk = E[Y1 | Fk]. This shows that Xk is a martingale (cf. the next to last example

in Section 11). By Doob’s upcrossing inequality, if UnX is the number of upcrossings of [a, b] by X, then
EUNX ≤EX

N/(b−a) ≤E|Z0|/(b−a) =E|Y1|/(b−a). This differs by at most one from the number of

upcrossings of [a, b] by Z1, . . . , ZN. So the expected number of upcrossings of [a, b] by Zk for k ≤ N is

bounded by 1 +E|Y1|/(b−a). Now letN → ∞. By Fatou’s lemma, the expected number of upcrossings

of [a, b] byZ1, . . . is finite. Arguing as in the proof of the martingale convergence theorem, this says that

Zn=Sn/ndoes not oscillate.

It is conceivable that|Sn/n| → ∞. But by Fatou’s lemma,

E[lim|Sn/n|]≤lim infE|Sn/n| ≤lim infnE|Y1|/n=E|Y1|<∞.

</div>
(20)<div class='page_container' data-page=20>

Proposition 15.1. Suppose theYi are i.i.d. withE|Y1|<∞,N is a stopping time withEN <∞, andN

is independent of theYi. ThenESN = (EN)(EY1), where theSn are the partial sums of theYi.

Proof. Sn−n(EY1) is a martingale, soESn∧N =E(n∧N)EY1 by optional stopping. The right hand side

tends to (EN)(EY1) by monotone convergence. Sn∧N converges almost surely toSN, and we need to show

the expected values converge.
Note

|Sn∧N|=

∞

k=0

|Sn∧k|1(N=k)≤

∞
X

k=0
n∧k

j=0

|Yj|1(N=k)

j=0

∞
X

k>j

|Yj|1(N=k)=
n

j=0

|Yj|1(N≥j)≤

∞
X

j=0

|Yj|1(N≥j).

The last expression, using the independence, has expected value
∞

j=0

(E|Yj|)P(N≥j)≤(E|Y1|)(1 +EN)<∞.

So by dominated convergence, we haveESn∧N →ESN.

Wald’s second identity is a similar expression for the variance ofSN.

We can use martingales to find certain hitting probabilities.

Proposition 15.2. Suppose theYi are i.i.d. withP(Y1= 1) = 1/2,P(Y1=−1) = 1/2, andSn the partial
sum process. Supposeaandbare positive integers. Then

P(Sn hits−abeforeb) =

b
a+b.

IfN = min{n:Sn ∈ {−a, b}}, thenEN =ab.

Proof. S2

n−nis a martingale, soESn2∧N =En∧N. Let n→ ∞. The right hand side converges to EN

by monotone convergence. Since Sn∧N is bounded in absolute value by a+b, the left hand side converges

by dominated convergence toESN2, which is finite. SoEN is finite, henceN is finite almost surely.

Sn is a martingale, soESn∧N =ES0= 0. By dominated convergence, and the fact thatN <∞a.s.,

henceSn∧N →SN, we haveESN = 0, or

−aP(SN =−a) +bP(SN =b) = 0.

We also have

P(SN =−a) +P(SN =b) = 1.

Solving these two equations for P(SN =−a) andP(SN =b) yields our first result. Since EN = ESN2 =

a2P(SN =−a) +b2P(SN =b), substituting gives the second result.

Based on this proposition, if we let a → ∞, we see that P(Nb < ∞) = 1 and ENb = ∞, where

Nb= min{n:Sn=b}.

</div>
(21)<div class='page_container' data-page=21>

Proposition 15.3. SupposeAn∈ Fn. Then(An i.o.)and(P∞n=1P(An| Fn−1) =∞)differ by a null set.

Proof. Let Xn = Pnm=1[1Am −P(Am | Fm−1)]. Note |Xn−Xn−1| ≤ 1. Also, it is easy to see that

E[Xn−Xn−1| Fn−1] = 0, soXn is a martingale.

We claim that for almost every ω either limXn exists and is finite, or else lim supXn = ∞ and

lim infXn = −∞. In fact, if N = min{n : Xn ≤ −k}, then Xn∧N ≥ −k−1, so Xn∧N converges by the

martingale convergence theorem. Therefore limXn exists and is finite on (N =∞). So if limXn does not

exist or is not finite, thenN <∞. This is true for allk, hence lim infXn =−∞. A similar argument shows

lim supXn=∞in this case.

Now if limXn exists and is finite, then

P∞

n=11An =∞if and only if

P(An | Fn−1)<∞. On the

other hand, if the limit does not exist or is not finite, thenP1

An =∞and

P(An| Fn−1) =∞.

16. Weak convergence.

We will see later that if theXi are i.i.d. with mean zero and variance one, thenSn/

√

n converges in
the sense

P(Sn/

√

n∈[a, b])→P(Z∈[a, b]),
where Z is a standard normal. If Sn/

√

n converged in probability or almost surely, then by the zero-one
law it would converge to a constant, contradicting the above. We want to generalize the above type of
convergence.

We say Fn converges weakly to F if Fn(x) → F(x) for all x at which F is continuous. Here Fn

and F are distribution functions. We say Xn converges weakly to X ifFXn converges weakly to FX. We

sometimes sayXn converges in distributionor converges in law toX. Probabilities µn converge weakly if

their corresponding distribution functions converges, that is, ifFµn(x) =µn(−∞, x] converges weakly.

An example that illustrates why we restrict the convergence to continuity points ofF is the following.
LetXn = 1/nwith probability one, andX = 0 with probability one. FXn(x) is 0 ifx <1/nand 1 otherwise.

FXn(x) converges toFX(x) for all xexceptx= 0.

Proposition 16.1. Xn converges weakly to X if and only if Eg(Xn) → Eg(X) for all g bounded and

continuous.

The idea that Eg(Xn) converges to Eg(X) for all g bounded and continuous makes sense for any
metric space and is used as a definition of weak convergence forXn taking values in general metric spaces.

Proof. First supposeEg(Xn) converges toEg(X). Letxbe a continuity point of F, letε >0, and choose
δsuch that|F(y)−F(x)|< εif|y−x|< δ. Choosegcontinuous such thatg is one on (−∞, x], takes values
between 0 and 1, and is 0 on [x+δ,∞). ThenFXn(x)≤Eg(Xn)→Eg(X)≤FX(x+δ)≤F(x) +ε.

Similarly, ifhis a continuous function taking values between 0 and 1 that is 1 on (−∞, x−δ] and 0
on [x,∞),FXn(x)≥Eh(Xn)→Eh(X)≥FX(x−δ)≥F(x)−ε. Sinceεis arbitrary,FXn(x)→FX(x).

Now suppose Xn converges weakly to X. If aand b are continuity points of F and of all theFXn,

thenE1[a,b](Xn) =FXn(b)−FXn(a)→F(b)−F(a) =E1[a,b](X). By taking linear combinations, we have

Eg(Xn)→Eg(X) for everyg which is a step function where the end points of the intervals are continuity
points for all the FXn and forFX. Since the set of points that are not a continuity point for some FXn or

forFX is countable, and we can approximate any continuous function on an interval by such step functions

uniformly, we haveEg(Xn)→Eg(X) for allgsuch that the support ofgis a closed interval whose endpoints
are continuity points ofFX andg is continuous on its support.

Letε >0 and choose M such thatFX(M)>1−ε and FX(−M)< ε and so thatM and −M are

</div>
(22)<div class='page_container' data-page=22>

whereg is a bounded continuous function. The difference betweenE(1[−M,M]g)(X) andEg(X) is bounded
by kgk∞P(X /∈ [−M, M]) ≤2εkgk∞. Similarly, whenX is replaced by Xn, the difference is bounded by

kgk∞P(Xn∈/[−M, M])→ kgk∞P(X /∈[−M, M]). So fornlarge, it is less than 3εkgk∞. Sinceεis arbitrary,

Eg(Xn)→Eg(X) wheneverg is bounded and continuous.

Let us examine the relationship between weak convergence and convergence in probability. The
example ofSn/

√

nshows that one can have weak convergence without convergence in probability.
Proposition 16.2. (a) IfXn converges to X in probability, then it converges weakly.

(b) If Xn converges weakly to a constant, it converges in probability.

Xn+Yn converges weakly toX+candXnYn converges weakly tocX.

Proof. To prove (a), let g be a bounded and continuous function. If nj is any subsequence, then there

exists a further subsequence such thatX(njk) converges almost surely toX. Then by dominated convergence,

Eg(X(njk))→Eg(X). That suffices to showEg(Xn) converges toEg(X).

For (b), ifXn converges weakly toc,

P(Xn−c > ε) =P(Xn > c+ε) = 1−P(Xn≤c+ε)→1−P(c≤c+ε) = 0.

We use the fact that if Y ≡ c, then c+ε is a point of continuity for FY. A similar equation shows

P(Xn−c≤ −ε)→0, soP(|Xn−c|> ε)→0.

We now prove the first part of (c), leaving the second part for the reader. Letxbe a point such that
x−cis a continuity point ofFX. Chooseεso thatx−c+εis again a continuity point. Then

P(Xn+Yn ≤x)≤P(Xn+c≤x+ε) +P(|Yn−c|> ε)→P(X ≤x−c+ε).

So lim supP(Xn+Yn ≤x)≤P(X+c≤x+ε). Sinceεcan be as small as we like andx−c is a continuity
point ofFX, then lim supP(Xn+Yn≤x)≤P(X+c≤x). The lim inf is done similarly.

We say a sequence of distribution functions{Fn}is tightif for each ε >0 there existsM such that

Fn(M)≥ 1−ε and Fn(−M) ≤ ε for alln. A sequence of r.v.s is tight if the corresponding distribution

functions are tight; this is equivalent toP(|Xn| ≥M)≤ε.

Theorem 16.3. (Helly’s theorem)LetFn be a sequence of distribution functions that is tight. There exists
a subsequencenj and a distribution functionF such thatFnj converges weakly toF.

What could happen is thatXn=n, so thatFXn →0; the tightness precludes this.

Proof. Let qk be an enumeration of the rationals. Since Fn(qk) ∈ [0,1], any subsequence has a further

subsequence that converges. Use diagonalization so that Fnj(qk) converges for each qk and call the limit

F(qk). F is nondecreasing, and defineF(x) = infqk≥xF(qk). So F is right continuous and nondecreasing.

Ifxis a point of continuity ofF andε >0, then there existrandsrational such thatr < x < sand
F(s)−ε < F(x)< F(r) +ε. Then

Fnj(x)≥Fnj(r)→F(r)> F(x)−ε

and

Fnj(x)≤Fnj(s)→F(s)< F(x) +ε.

Sinceεis arbitrary,Fnj(x)→F(x).

Since the Fn are tight, there exists M such that Fn(−M) < ε. Then F(−M) ≤ ε, which implies

</div>
(23)<div class='page_container' data-page=23>

Proposition 16.4. Suppose there exists ϕ: [0,∞)→[0,∞)that is increasing andϕ(x)→ ∞as x→ ∞.
Ifc= supnEϕ(|Xn|)<∞, then theXn are tight.

Proof. Letε >0. ChooseM such thatϕ(x)≥c/εifx > M. Then
P(|Xn|> M)≤

Z ϕ(|X

n|)

c/ε 1(|Xn|>M)dP≤

cEϕ(|Xn|)≤ε.

17. Characteristic functions.

We define thecharacteristic function of a random variableX byϕX(t) =Eeitxfort∈R.

Note thatϕX(t) =R eitxPX(dx). So ifX andY have the same law, they have the same characteristic

function. Also, if the law of X has a density, that is,PX(dx) =fX(x)dx, thenϕX(t) =ReitxfX(x)dx, so

in this case the characteristic function is the same as (one definition of) the Fourier transform offX.

Proposition 17.1. ϕ(0) = 1,|ϕ(t)| ≤1,ϕ(−t) =ϕ(t), andϕis uniformly continuous.

Proof. Since|eitx| ≤1, everything follows immediately from the definitions except the uniform continuity.

For that we write

|ϕ(t+h)−ϕ(t)|=|Eei(t+h)X−EeitX| ≤E|eitX(eihX−1)|=E|eihX−1|.

|eihX−1| tends to 0 almost surely ash→0, so the right hand side tends to 0 by dominated convergence.

Note that the right hand side is independent oft.

Proposition 17.2. ϕaX(t) =ϕX(at)andϕX+b(t) =eitbϕX(t),

Proof. The first follows fromEeit(aX)=Eei(at)X, and the second is similar.
Proposition 17.3. IfX andY are independent, thenϕX+Y(t) =ϕX(t)ϕY(t).

Proof. From the multiplication theorem,

Eeit(X+Y)=EeitXeitY =EeitXEeitY.

Note that ifX1 andX2are independent and identically distributed, then

ϕX1−X2(t) =ϕX1(t)ϕ−X2(t) =ϕX1(t)ϕX2(−t) =ϕX1(t)ϕX2(t) =|ϕX1(t)|

2.

Let us look at some examples of characteristic functions.

(a) Bernoulli: By direct computation, this is peit+ (1−p) = 1−p(1−eit).
(b) Coin flip: (i.e.,P(X = +1) =P(X =−1) = 1/2) We have 12e

it+1
2e

−it= cost.

EeitX=
∞
X

k=0

eitke−λλ

k! =e

−λX(λeit)k
k! =e

−λeλeit

=eλ(eit−1).

</div>
(24)<div class='page_container' data-page=24>

(e) Binomial: WriteX as the sum ofnindependent Bernoulli r.v.sBi. So

ϕX(t) =
n

i=1

ϕBi(t) = [ϕBi(t)]

= [1−p(1−eit)]n.

(f) Geometric:

ϕ(t) =
∞
X

k=0

p(1−p)keitk=pX((1−p)eit)k = p
1−(1−p)eit.

(g) Uniform on[a, b]:

ϕ(t) = 1
b−a

Z b

eitxdx= e

itb−eita

(b−a)it .
Note that whena=−bthis reduces to sin(bt)/bt.

(h) Exponential:

Z ∞

λeitxe−λxdx=λ
Z ∞

e(it−λ)xdx= λ
λ−it.
(i) Standard normal:

ϕ(t) = √1
2π

Z ∞
−∞

eitxe−x2/2dx.

This can be done by completing the square and then doing a contour integration. Alternately,ϕ0(t) =
(1/√2π)R−∞∞ ixeitxe−x2/2dx. (do the real and imaginary parts separately, and use the dominated
convergence theorem to justify taking the derivative inside.) Integrating by parts (do the real and

imaginary parts separately), this is equal to−tϕ(t). The only solution toϕ0(t) =−tϕ(t) withϕ(0) = 1
isϕ(t) =e−t2/2

(j) Normal with mean µ and variance σ2: Writing X =σZ +µ, where Z is a standard normal, then

ϕX(t) =eiµtϕZ(σt) =eiµt−σ

2t2/2

.
(k) Cauchy: We have

ϕ(t) = 1
π

Z eitx

1 +x2dx.

This is a standard exercise in contour integration in complex analysis. The answer ise−|t|.

18. Inversion formula.

We need a preliminary real variable lemma, and then we can proceed to the inversion formula, which
gives a formula for the distribution function in terms of the characteristic function.

Lemma 18.1. (a)RN

0 (sin(Ax)/x)dx→sgn (A)π/2 asN → ∞.
(b) supa|Ra

0(sin(Ax)/x)dx|<∞.

</div>
(25)<div class='page_container' data-page=25>

An alternate proof of (a) is the following. e−xysinxis integrable on{(x, y); 0< x < a,0< y <∞}.
So

Z a

sinx
x dx=

Z a

Z ∞

e−xysinx dy dx

=
Z ∞

Z a

e−xysinx dx dy

=
Z ∞

h e−xy

y2+ 1(−ysinx−cosx)

ia
0
dy
=
Z ∞
0
h
e
−ay

y2+ 1(−ysina−cosa)

o
− −1

y2+ 1

i
dy

= π
2 −sina

Z ∞

ye−ay

y2+ 1dy−cosa

Z ∞

e−ay
y2+ 1dy.

The last two integrals tend to 0 asa→ ∞since the integrand is bounded by (1 +y)e−y ifa≥1.
Theorem 18.2. (Inversion formula)Let µbe a probability measure and let ϕ(t) =R

eitxµ(dx). If a < b,
then
lim
T→∞

1
2π
Z T
−T

e−ita−e−itb

it ϕ(t)dt=µ(a, b) +
1

2µ({a}) +
1
2µ({b}).

The example whereµis point mass at 0, soϕ(t) = 1, shows that one needs to take a limit, since the
integrand in this case is 2 sint/t, which is not integrable.

Proof. By Fubini,

Z T

−T

e−ita−e−itb

it ϕ(t)dt=
Z T

−T

Z e−ita−e−itb

it e

itxµ(dx)dt

=
Z Z T

−T

e−ita−e−itb

it e

itxdt µ(dx).

To justify this, we bound the integrand by the mean value theorem

Expandinge−itb ande−ita using Euler’s formula, and using the fact that cos is an even function and

sin is odd, we are left with
Z

2h
Z T

sin(t(x−a))

t dt−

Z T

sin(t(x−b))

t dt

i
µ(dx).
Using Lemma 18.1 and dominated convergence, this tends to

[πsgn (x−a)−πsgn (x−b)]µ(dx).

Theorem 18.3. IfR|ϕ(t)|dt <∞, thenµhas a bounded densityf and

f(y) = 1
2π

e−ityϕ(t)dt.

Proof.

µ(a, b) +1

2µ({a}) +
1
2µ({b})
= lim
T→∞
1
2π
Z T
−T

e−ita−e−itb

it ϕ(t)dt
= 1

2π
Z ∞

−∞

e−ita−e−itb
it ϕ(t)dt
≤b−a

2π
Z

</div>
(26)<div class='page_container' data-page=26>

Lettingb→ashows thatµhas no point masses.
We now write

µ(x, x+h) = 1
2π

Z e−itx−e−it(x+h)

it ϕ(t)dt
= 1

2π

Z Z x+h

e−itydyϕ(t)dt

=
Z x+h

1
2π

e−ityϕ(y)dtdy.

Soµhas density (1/2π)R

e−ityϕ(t)dt. As in the proof of Proposition 17.1, we seef is continuous.

A corollary to the inversion formula is the uniqueness theorem.
Theorem 18.3. IfϕX=ϕY, thenPX =PY.

The following proposition can be proved directly, but the proof using characteristic functions is much
easier.

Proposition 18.4. (a) IfX andY are independent,X is a normal with meanaand varianceb2, andY is
a normal with meancand varianced2, thenX+Y is normal with meana+c and varianceb2+d2.

(b) If X andY are independent,X is Poisson with parameterλ1, andY is Poisson with parameterλ2,
thenX+Y is Poisson with parameterλ1+λ2.

Proof. For (a),

ϕX+Y(t) =ϕX(t)ϕY(t) =eiat−b

2t2/2

eict−c2t2/2=ei(a+c)t−(b2+d2)t2/2.
Now use the uniqueness theorem.

Parts (b) and (c) are proved similarly.

19. Continuity theorem.

Lemma 19.1. Supposeϕis the characteristic function of a probabilityµ. Then

µ([−2A,2A])≥A

Z 1/A

−1/A

ϕ(t)dt

−1.

Proof. Note

1
2T

Z T

−T

ϕ(t)dt= 1
2T

Z T

−T

eitxµ(dx)dt

Z Z 1

2T1[−T ,T](t)e

itxdtµ(dx)

Z sinT x
T x µ(dx).

Z sinT x
T x µ(dx)

≤µ([−2A,2A]) +
Z

[−2A,2A]c

2T Aµ(dx)
=µ([−2A,2A]) + 1

2T A(1−µ([−2A,2A])

= 1

2T A+

1− 1
2T A

</div>
(27)<div class='page_container' data-page=27>

SettingT = 1/A,

Z 1/A

−1/A

ϕ(t)dt
≤

1
2+

2µ([−2A,2A]).
Now multiply both sides by 2.

Proposition 19.2. Ifµn converges weakly toµ, thenϕn converges toϕuniformly on every finite interval.

Proof. Let ε > 0 and choose M large so that µ([−M, M]c) < ε. Define f to be 1 on [−M, M], 0 on

[−M−1, M+ 1]c, and linear in between. SinceR

f dµn→R f dµ, then ifnis large enough,

(1−f)dµn≤2ε.

We have

|ϕn(t+h)−ϕn(t)| ≤

|eihx−1|µ
n(dx)

≤2
Z

(1−f)dµn+h

|x|f(x)µn(dx)

≤2ε+h(M + 1).
So fornlarge enough and|h| ≤ε/(M+ 1), we have

|ϕn(t+h)−ϕn(t)| ≤3ε,

which says that theϕn are equicontinuous. Therefore the convergence is uniform on finite intervals.

The interesting result of this section is the converse, L´evy’s continuity theorem.

Theorem 19.3. Suppose µn are probabilities, ϕn(t) converges to a function ϕ(t) for each t, and ϕ is
continuous at 0. Thenϕis the characteristic function of a probabilityµandµn converges weakly toµ.

Proof. Letε >0. Sinceϕis continuous at 0, chooseδsmall so that

1
2δ

Z δ

−δ

ϕ(t)dt−1
< ε.
Using the dominated convergence theorem, chooseN such that

1
2δ

Z δ

−δ

|ϕn(t)−ϕ(t)|dt < ε

ifn≥N. So ifn≥N,

2δ

Z δ

−δ

ϕn(t)dt

≥

1
2δ

Z δ

−δ

ϕ(t)dt
−

1
2δ

Z δ

−δ

|ϕn(t)−ϕ(t)|dt

≥1−2ε.
By Lemma 19.1 withA= 1/δ, for suchn,

µn[−2/δ,2/δ]≥2(1−2ε)−1 = 1−4ε.

This shows theµn are tight.

Let nj be a subsequence such that µnj converges weakly, say to µ. Then ϕnj(t) → ϕµ(t), hence

ϕ(t) = ϕµ(t), or ϕ is the characteristic function of a probability µ. If µ0 is any subsequential weak limit

point ofµn, thenϕµ0(t) =ϕ(t) =ϕµ(t); soµ0 must equal µ. Henceµn converges weakly toµ.

</div>
(28)<div class='page_container' data-page=28>

Proposition 19.4. IfE|X|k <∞for an integerk, thenϕX has a continuous derivative of orderk and

ϕ(k)(t) =
Z

(ix)keitxPX(dx).

In particular,ϕ(k)(0) =ikEXk.

Proof. Write

ϕ(t+h)−ϕ(t)

h =

Z ei(t+h)x−eitx

h P(dx).
The integrand is bounded by|x|. So if R

|x|PX(dx)<∞, we can use dominated convergence to obtain the

desired formula forϕ0(t). As in the proof of Proposition 17.1, we seeϕ0(t) is continuous. We do the case of
generalkby induction. Evaluatingϕ(k) at 0 gives the particular case.

Here is a converse.

Proposition 19.5. Ifϕis the characteristic function of a random variableXandϕ00(0)exists, thenE|X|2<
∞.

Proof. Note

eihx−2 +e−ihx
h2 =−2

1−coshx
h2 ≤0

and 2(1−coshx)/h2converges to x2 ash→0. So by Fatou’s lemma,

x2PX(dx)≤2 lim inf
h→0

Z 1−coshx

h2 PX(dx)

=−lim sup

h→0

ϕ(h)−2ϕ(0) +ϕ(−h)

h2 =ϕ

00(0)<∞.

One nice application of the continuity theorem is a proof of the weak law of large numbers. Its proof
is very similar to the proof of the central limit theorem, which we give in the next section.

Another nice use of characteristic functions and martingales is the following.

Proposition 19.6. Suppose Xi is a sequence of independent r.v.s and Sn converges weakly. Then Sn
converges almost surely.

Proof. SupposeSnconverges weakly toW. ThenϕSn(t)→ϕW(t) uniformly on compact sets by Proposition

19.2. SinceϕW(0) = 1 andϕW is continuous, there existsδsuch that|ϕW(t)−1|<1/2 if|t|< δ. So forn

large,ϕSn(t)| ≥1/4 if|t|< δ.

Note
E

eitSn|X

1, . . . Xn−1

=eitSn−1

E[eitXn|X1, . . . , Xn−1] =eitSn−1ϕXn(t).

SinceϕSn(t) =

Qϕ

Xi(t), it follows thate

itSn/ϕ

Sn(t) is a martingale.

Therefore for|t|< δand nlarge,eitSn/ϕ

Sn(t) is a bounded martingale, and hence converges almost

surely. SinceϕSn(t)→ϕW(t)6= 0, then e

itSn converges almost surely if|t|< δ.

Let A = {(ω, t) ∈ Ω×(−δ, δ) : eitSn(ω) does not converge}. For each t, we have almost sure

con-vergence, soR

1A(ω, t)P(dω) = 0. Therefore R−δδ

1AdPdt= 0, and by Fubini, R R−δδ1Adt dP= 0. Hence

almost surely,R

1A(ω, t)dt= 0. This means, there exists a setN withP(N) = 0, and ifω /∈N, theneitSn(ω)

</div>
(29)<div class='page_container' data-page=29>

Ifω /∈N, by dominated convergence,R0aeitSn(ω)dtconverges, provideda < δ. Call the limitA

a. Also

Z a

eitSn(ω)dt=e

iaSn(ω)−1

iSn(ω)

ifSn(ω)6= 0 and equalsaotherwise.

Since Sn converges weakly, it is not possible for |Sn| → ∞ with positive probability. If we let

N0 = {ω : |S

n(ω)| → ∞} and choose ω /∈ N ∪N0, there exists a subsequence Snj(ω) which converges

to a finite limit, say R. We can choose a < δ such that eiaSn(ω) converges and eiaR 6= 1. Therefore

Aa = (eiaR−1)/R, a nonzero quantity. But then

Sn(ω) =

eiaSn(ω)−1

0 e

itSn(ω)dt

→ limn→∞e

iaSn(ω)−1

Therefore, except forω∈N∪N0, we have thatSn(ω) converges.

20. Central limit theorem.

The simplest case of the central limit theorem (CLT) is the case when the Xi are i.i.d., with mean

zero and variance one, and then the CLT says thatSn/

√

nconverges weakly to a standard normal. We first
prove this case.

We need the fact that ifcn are complex numbers converging toc, then (1 + (cn/n))n →ec. We leave

the proof of this to the reader, with the warning that any proof using logarithms needs to be done with some
care, since logz is a multi-valued function whenz is complex.

Theorem 20.1. Suppose theXi are i.i.d., mean zero, and variance one. ThenSn/

√

nconverges weakly to
a standard normal.

Proof. Since X1 has finite second moment, then ϕX1 has a continuous second derivative. By Taylor’s

theorem,

ϕX1(t) =ϕX1(0) +ϕ

X1(0)t+ϕ

X1(0)t

2/2 +R(t),

where|R(t)|/t2→0 as|t| →0. So

ϕX1(t) = 1−t

2/2 +R(t).

Then

ϕS

√

n(t) =ϕSn(t/

√

n) = (ϕX1(t/

√

n))n=h1− t

2n+R(t/
√

n)i

Sincet/√nconverges to zero asn→ ∞, we have

ϕS

√

n(t)→e−
t2/2

Now apply the continuity theorem.

</div>
(30)<div class='page_container' data-page=30>

Proposition 20.2. WithXi as above,Sn/

√

nconverges weakly to a standard normal.

Proof. LetY1, . . . , Ynbe i.i.d. standard normal r.v.s that are independent of theXi. LetZ1=Y2+· · ·+Yn,

Z2=X1+Y3+· · ·+Yn,Z3=X1+X2+Y4+· · ·+Yn, etc.

Let us supposeg ∈C3 with compact support and letW be a standard normal. Our first goal is to

show

|Eg(Sn/

√

n)−Eg(W)| →0. (20.1)

We have

Eg(Sn/

√

n)−Eg(W) =Eg(Sn/

√

n)−Eg(

i=1

Yi/

√
n)
=
n
X
i=1
h
Eg

Xi+Zi
√

−Eg

Yi+Zi
√

n
i

By Taylor’s theorem,

gXi√+Zi
n

=g(Zi/

√

n) +g0(Zi/

√
n)√Xi

n+
1
2g
00(Z
i/
√

n)Xi2+Rn,

where|Rn| ≤ kg000k∞|Xi|3/n3/2. Taking expectations and using the independence,

Xi+Zi
√

=Eg(Zi/

√

n) + 0 +1
2Eg

00(Z

√

n) +ERn.

We have a very similar expression forEg((Yi+Zi)/

√

n). Taking the difference,

Xi+Zi
√

−Eg

Yi+Zi
√
n

≤ kg
000k
∞E

|Xi|3+E|Yi|3

n3/2 .

Summing overi from 1 ton, we have (20.1).

By approximating continuous functions with compact support byC3functions with compact support,
we have (20.1) for suchg. Since E(Sn/

√

n)2 = 1, the sequenceSn/

√

n is tight. So givenεthere exists M
such thatP(|Sn/

√

n|> M)< εfor alln. By taking M larger if necessary, we also haveP(|W|> M)< ε.
Supposegis bounded and continuous. Letψbe a continuous function with compact support that is bounded
by one, is nonnegative, and that equals 1 on [−M, M]. By (20.1) applied togψ,

|E(gψ)(Sn/

√

n)−E(gψ)(W)| →0.

However,

|Eg(Sn/

√

n)−E(gψ)(Sn/

√

n)| ≤ kgk∞P(|Sn/

√

n|> M)< εkgk∞,
and similarly

|Eg(W)−E(gψ)(W)|< εkgk∞.

Since ε is arbitrary, this proves (20.1) for bounded continuous g. By Proposition 16.1, this proves our
proposition.

</div>
(31)<div class='page_container' data-page=31>

Proposition 20.3. Suppose for each nthe r.v.sXni, i= 1, . . . , n are i.i.d. Bernoullis with parameterpn.
Ifnpn →λandSn=P

i=1Xni, thenSn converges weakly to a Poisson r.v. with parameter λ.

Proof. We write

ϕSn(t) = [ϕXn1(t)]

n =h1 +p

n(eit−1)

=h1 + npn
n (e

it−1)in→eλ(eit−1)

.
Now apply the continuity theorem.

A much more general theorem than Theorem 20.1 is the Lindeberg-Feller theorem.

Theorem 20.4. Suppose for eachn,Xni,i= 1, . . . , nare mean zero independent random variables. Suppose
(a) Pn

i=1EX
2

ni→σ2>0 and
(b) for eachε,Pn

i=1E[|X
2

ni;|Xni|> ε]→0.
LetSn=P

i=1Xni. ThenSn converges weakly to a normal r.v. with mean zero and varianceσ2.

Note nothing is said about independence of theXni for differentn.

Let us look at Theorem 20.1 in light of this theorem. Suppose theYi are i.i.d. and letXni=Yi/

√
n.
Then

i=1

E(Yi/

√

n)2=EY12

and

i=1

E[|Xni|2;|Xni|> ε] =nE[|Y1|2/n;|Y1|>

√

nε] =E[|Y1|2;|Y1|>

√
nε],

which tends to 0 by the dominated convergence theorem.
If theYi are independent with mean 0, and

i=1E|Yi|3

(VarSn)3/2

→0,

then Sn/(VarSn)1/2 converges weakly to a standard normal. This is known as Lyapounov’s theorem; we

leave the derivation of this from the Lindeberg-Feller theorem as an exercise for the reader.

Proof. Letϕnibe the characteristic function ofXniand letσni2 be the variance ofXni. We need to show
n

i=1

ϕni(t)→e−t

2σ2/2

. (20.2)

Using Taylor series,|eib−1−ib+b2/2| ≤c|b|3 for a constantc. Also,

|eib−1−ib+b2/2| ≤ |eib−1−ib|+|b2|/2≤c|b|2.

If we apply this to a random variabletY and take expectations,

|ϕY(t)−(1 +itEY −t2EY2/2)| ≤c(t2EY2∧t3EY3).

Applying this toY =Xni,

|ϕni(t)−(1−t2σni2 /2)| ≤cE[t
3|X

</div>
(32)<div class='page_container' data-page=32>

The right hand side is less than or equal to

cE[t3|Xni|3;|Xni| ≤ε] +cE[t2|Xni|2;|Xni|> ε]

≤cεt3E[|Xni|2] +ct2E[|Xni|2;|Xni| ≥ε].

Summing overi we obtain

i=1

|ϕni(t)−(1−t2σ2ni/2)| ≤cεt
3X

E[|Xni|2] +ct2

E[|Xni|2;|Xni| ≥ε].

We need the following inequality: if|ai|,|bi| ≤1, then

n
Y
i=1

ai−
n
Y
i=1
bi

≤
n
X
i=1

|ai−bi|.

To prove this, note

Y
ai−

bi= (an−bn)

i<n

bi+an

i<n

ai−

i<n

and use induction.

Note |ϕni(t)| ≤ 1 and |1−t2σni2 /2| ≤1 because σni2 ≤ε2+E[|Xni2|;|Xni| > ε] <1/t2 if we takeε

small enough andnlarge enough. So

n
Y
i=1

ϕni(t)−
n

i=1

(1−t2σni2 /2)
≤cεt

E[|Xni|2] +ct2

E[|Xni|2;|Xni| ≥ε].

Since supiσ2

ni→0, then log(1−t2σni2/2) is asymptotically equal to−t2σ2ni/2, and so

(1−t2σ2ni/2) = exp Xlog(1−t2σni2 /2)

is asymptotically equal to

exp−t2Xσni2 /2=e−t2σ2/2.
Sinceεis arbitrary, the proof is complete.

We now complete the proof of Theorem 8.2.
Proof of “only if” part of Theorem 8.2. Since P

Xn converges, then Xn must converge to zero a.s.,

and soP(|Xn|> Ai.o.) = 0. By the Borel-Cantelli lemma, this saysPP(|Xn|> A)<∞. We also conclude

by Proposition 5.4 thatP

Yn converges.

Letcn =P
n

i=1VarYi and supposecn→ ∞. LetZnm= (Ym−EYm)/

√

cn. ThenP
n

m=1VarZnm=

(1/cn)P
n

m=1VarYm = 1. If ε > 0, then for n large, we have 2A/

√

cn < ε. Since |Ym| ≤ A and hence

|EYm| ≤ A, then |Znm| ≤ 2A/

√

cn < ε. It follows that P
n

m=1E(|Znm|2;|Znm| > ε) = 0 for large n.

By Theorem 20.4, Pn

m=1(Ym−EYm)/

√

cn converges weakly to a standard normal. However, P
n
m=1Ym

converges, andcn → ∞, soPYm/

√

cnmust converge to 0. The quantitiesPEYm/

√

cn are nonrandom, so

there is no way the difference can converge to a standard normal, a contradiction. We concludecn does not

converge to infinity.

LetVi =Yi−EYi. Since |Vi| <2A, EVi = 0, and VarVi = VarYi, which is summable, by the “if”

part of the three series criterion,P

Vi converges. SincePYi converges, taking the difference shows PEYi

</div>
(33)<div class='page_container' data-page=33>

21. Framework for Markov chains.

SupposeS is a set with some topological structure that we will use as our state space. Think of S
as beingRd or the positive integers, for example. A sequence of random variablesX

0, X1, . . ., is a Markov

chain if

P(Xn+1∈A|X0, . . . , Xn) =P(Xn+1∈A|Xn) (21.1)

for all n and all measurable sets A. The definition of Markov chain has this intuition: to predict the
probability that Xn+1 is in any set, we only need to know where we currently are; how we got there gives

no new additional intuition.

Let’s make some additional comments. First of all, we previously considered random variables as
mappings from Ω toR. Now we want to extend our definition by allowing a random variable be a mapX
from Ω toS, where (X ∈A) is F measurable for all open setsA. This agrees with the definition of r.v. in
the caseS =R.

Although there is quite a theory developed for Markov chains with arbitrary state spaces, we will
con-fine our attention to the case where eitherS is finite, in which case we will usually supposeS={1,2, . . . , n},
or countable and discrete, in which case we will usually supposeS is the set of positive integers.

We are going to further restrict our attention to Markov chains where

P(Xn+1∈A|Xn=x) =P(X1∈A|X0=x),

that is, where the probabilities do not depend onn. Such Markov chains are said to have stationary transition
probabilities.

Define the initial distribution of a Markov chain with stationary transition probabilities by µ(i) =
P(X0 = i). Define the transition probabilities by p(i, j) = P(Xn+1 = j | Xn = i). Since the transition

probabilities are stationary,p(i, j) does not depend onn.

In this case we can use the definition of conditional probability given in undergraduate classes. If
P(Xn =i) = 0 for alln, that means we never visitiand we could drop the pointifrom the state space.

Proposition 21.1. LetX be a Markov chain with initial distributionµand transition probabilitiesp(i, j).
then

P(Xn=in, Xn−1=in−1, . . . , X1=i1, X0=i0) =µ(i0)p(i0, i1)· · ·p(in−1, in). (21.2)

Proof. We use induction onn. It is clearly true forn= 0 by the definition ofµ(i). Suppose it holds forn;
we need to show it holds forn+ 1. For simplicity, we will do the case n= 2. Then

P(X3=i3,X2=i2, X1=i1, X0=i0)

=E[P(X3=i3|X0=i0, X1=i1, X2=i2);X2=i2, X1=ii, X0=i0]

=E[P(X3=i3|X2=i2);X2=i2, X1=ii, X0=i0]

=p(i2, i3)P(X2=i2, X1=i1, X0=i0).

Now by the induction hypothesis,

P(X2=i2, X1=i1, X0=i0) =p(i1, i2)p(i0, i1)µ(i0).

Substituting establishes the claim forn= 3.

</div>
(34)<div class='page_container' data-page=34>

Proposition 21.2. Suppose µ(i) is a sequence of nonnegative numbers with P

iµ(i) = 1 and for each i
the sequencep(i, j)is nonnegative and sums to 1. Then there exists a Markov chain withµ(i)as its initial
distribution andp(i, j)as the transition probabilities.

Proof. Define Ω = S∞. Let F be the σ-fields generated by the collection of sets {(i

0, i1, . . . , in) : n >

0, ij ∈ S}. An element ω of Ω is a sequence (i0, i1, . . .). Define Xj(ω) = ij if ω = (i0, i1, . . .). Define

P(X0 =i0, . . . , Xn=in) by (21.2). Using the Kolmogorov extension theorem, one can show thatP can be
extended to a probability on Ω.

The above framework is rather abstract, but it is clear that under P the sequence Xn has initial

distributionµ(i); what we need to show is thatXn is a Markov chain and that

P(Xn+1=in+1|X0=i0, . . . Xn=in) =P(Xn+1=in+1|Xn=in) =p(in, in+1). (21.3)

By the definition of conditional probability, the left hand side of (21.3) is

P(Xn+1=in+1|X0=i0, . . . , Xn =in) =P

(Xn+1=in+1, Xn=in, . . . X0=i0)

P(Xn=in, . . . , X0=i0)

(21.4)
=µ(i0)· · ·p(in−1, in)p(in, in+1)

µ(i0)· · ·p(in−1, in)

=p(in, in+1)

as desired.

To complete the proof we need to show

P(Xn+1=in+1, Xn=in)

P(Xn =in)

=p(in, in+1),

P(Xn+1=in+1, Xn =in) =p(in, in+1)P(Xn =in). (21.5)

Now

P(Xn=in) =

i0,···,in−1

P(Xn =in, Xn−1=in−1, . . . , X0=i0)

= X

i0,···,in−1

µ(i0)· · ·p(in−1, in)

and similarly

P(Xn+1=in+1, Xn=in)

= X

i0,···,in−1

P(Xn+1=in+1, Xn=in, Xn−1=in−1, . . . , X0=i0)

=p(in, in+1)

i0,···,in−1

µ(i0)· · ·p(in−1, in).

Equation (21.5) now follows.

Note in this construction that theXn sequence is fixed and does not depend onµorp. Letp(i, j) be

fixed. The probability we constructed above is often denoted Pµ. Ifµis point mass at a point i orx, it is
denoted Pi orPx. So we have one probability space, one sequenceXn, but a whole family of probabilities

</div>
(35)<div class='page_container' data-page=35>

Later on we will see that this framework allows one to express the Markov property and strong
Markov property in a convenient way. As part of the preparation for doing this, we define the shift operators
θk : Ω→Ω by

θk(i0, i1, . . .) = (ik, ik+1, . . .).

ThenXj◦θk=Xj+k. To see this, ifω= (i0, i1, . . .), then

Xj◦θk(ω) =Xj(ik, ik+1, . . .) =ij+k =Xj+k(ω).

22. Examples.

Random walk on the integers

We let Yi be an i.i.d. sequence of r.v.’s, with p = P(Yi = 1) and 1−p = P(Yi = −1). Let

Xn =X0+Pni=1Yi. Then theXn can be viewed as a Markov chain withp(i, i+ 1) =p,p(i, i−1) = 1−p,

and p(i, j) = 0 if|j−i| 6= 1. More general random walks on the integers also fit into this framework. To
check that the random walk is Markov,

P(Xn+1=in+1|X0=i0, . . . , Xn=in)

=P(Xn+1−Xn=in+1−in|X0=i0, . . . , Xn=in)

=P(Xn+1−Xn=in+1−in),

using the independence, while

P(Xn+1=in+1|Xn=in) =P(Xn+1−Xn=in+1−in|Xn=in)

=P(Xn+1−Xn=in+1−in).

Random walks on graphs

Suppose we havenpoints, and from each point there is some probability of going to another point.
For example, suppose there are 5 points and we have p(1,2) = 1

2, p(1,3) =
1

2, p(2,1) =
1

4, p(2,3) =
1
2,

p(2,5) = 14,p(3,1) = 14, p(3,2) = 14, p(3,3) = 12, p(4,1) = 1,p(5,1) = 12, p(5,5) = 12. Thep(i, j) are often
arranged into a matrix:

P=

















0 12 12 0 0

1
4 0
1
2 0
1
4
1
4
1
4
1
2 0 0

1 0 0 0 0

2 0 0 0
1
2















.

Note the rows must sum to 1 since

j=1

p(i, j) =

j=1

P(X1=j|X0=i) =P(X1∈ S |X0=i) = 1.

Renewal processes

Let Yi be i.i.d. withP(Yi = k) = ak and the ak are nonnegative and sum to 1. Let T0 = i0 and

Tn=T0+P
n

i=1. We think of theYias the lifetime of thenth light bulb andTnthe time when thenth light

</div>
(36)<div class='page_container' data-page=36>

SoXn is the amount of time after timenuntil the current light bulb burns out.

IfXn=j andj >0, thenTi =n+j for someibutTi does not equaln, n+ 1, . . . , n+j−1 for any

i. SoTi= (n+ 1) + (j−1) for somei andTi does not equal (n+ 1),(n+ 1) + 1, . . . ,(n+ 1) + (j−2) for

anyi. ThereforeXn+1=j−1. Sop(i, i−1) = 1 if i≥1.

IfXn = 0, then a light bulb burned out at time n andXn+1 is 0 if the next light bulb burned out

immediately andj−1 if the light bulb has lifetimej. The probability of this isaj. Sop(0, j) =aj+1. All

the otherp(i, j)’s are 0.

Branching processes

Considerkparticles. At the next time interval, some of them die, and some of them split into several
particles. The probability that a given particle will split into j particles is given byaj, j = 0,1, . . ., where

the aj are nonnegative and sum to 1. The behavior of each particle is independent of the behavior of all

the other particles. IfXn is the number of particles at timen, thenXn is a Markov chain. LetYi be i.i.d.

random variables withP(Yi=j) =aj. Thep(i, j) forXn are somewhat complicated, and can be defined by

p(i, j) =P(Pim=1Ym=j).
Queues

We will discuss briefly theM/G/1 queue. TheM refers to the fact that the customers arrive according
to a Poisson process. So the probability that the number of customers arriving in a time interval of lengtht
iskis given bye−λt(λt)k/k! TheGrefers to the fact that the length of time it takes to serve a customer is

given by a distribution that is not necessarily exponential. The 1 refers to the fact that there is 1 server.
Suppose the length of time to serve one customer has distribution functionF with density f. The
probability thatkcustomers arrive during the time it takes to serve one customer is

ak=

Z ∞

e−λt(λt)

k! f(t)dt.

Let theYi be i.i.d. withP(Yi =k−1) =ak. SoYi is the number of customers arriving during the time it

takes to serve one customer. LetXn+1= (Xn+Yn+1)+ be the number of customers waiting. ThenXn is a

Markov chain withp(0,0) =a0+a1 andp(i, j−1 +k) =ak ifj≥1, k >1.
Ehrenfest urns

Suppose we have two urns with a total of r balls, k in one and r−k in the other. Pick one of the
r balls at random and move it to the other urn. LetXn be the number of balls in the first urn. Xn is a

Markov chain withp(k, k+ 1) = (r−k)/r,p(k, k−1) =k/r, andp(i, j) = 0 otherwise.

One model for this is to consider two containers of air with a thin tube connecting them. Suppose
a few molecules of a foreign substance are introduced. Then the number of molecules in the first container
is like an Ehrenfest urn. We shall see that all states in this model are recurrent, so infinitely often all the
molecules of the foreign substance will be in the first urn. Yet there is a tendency towards equilibrium, so
on average there will be about the same number of molecules in each container for all large times.

Birth and death processes

Suppose there are i particles, and the probability of a birth is ai, the probability of a death is bi,

whereai, bi ≥0,ai+bi ≤1. SettingXn equal to the number of particles, thenXn is a Markov chain with

p(i, i+ 1) =ai,p(i, i−1) =bi, andp(i, i) = 1−ai−bi.

23. Markov properties.

A special case of the Markov property says that

</div>
(37)<div class='page_container' data-page=37>

The right hand side is to be interpreted as ϕ(Xn), whereϕ(y) =Eyf(X1). The randomness on the right

hand side all comes from the Xn. If we write f(Xn+1) =f(X1)◦θn and we writeY for f(X1), then the

above can be rewritten

Ex[Y ◦θn| Fn] =EXnY.
LetF∞ be theσ-field generated by∪∞

n=1Fn.

Theorem 23.1. (Markov property)IfY is bounded and measurable with respect toF∞, then

Ex[Y ◦θn| Fn] =EXn[Y], P−a.s.

for eachnand x.

Proof. If we can prove this forY =f1(X1)· · ·fm(Xm), then takingfj(x) = 1ij(x), we will have it forY’s

of the form 1(X1=i1,...,Xm=im). By linearity (and the fact thatS is countable), we will then have it for Y’s

of the form 1((X1,...,Xm)∈B). A monotone class argument shows that suchY’s generateF∞.

We use induction onm, and first we prove it form= 1. We need to show

E[f1(X1)◦θn| Fn] =EXnf1(X1).

Using linearity and the fact that S is countable, it suffices to show this for f1(y) = 1{j}(y). Using the
definition ofθn, we need to show

P(Xn+1=j| Fn) =PXn(X1=j),

or equivalently,

Px(Xn+1=j;A) =Ex[PXn(X1=j);A] (23.2)

whenA∈ Fn. By linearity it suffices to considerAof the formA= (X1=i1, . . . , Xn=in). The left hand

side of (23.2) is then

Px(Xn+1=j, X1=i1, . . . , Xn=ij),

and by (21.4) this is equal to

p(in, j)Px(X1=i1, . . . , Xn−in) =p(in, j)Px(A).

Letg(y) =Py(X

1=j). We have

Px(X1=j, X0=k) =

Pk(X1=j) ifx=k,

0 ifx6=k,

while

Ex[g(X0);X0=k] =Ex[g(k);X0=k] =Pk(X1=j)Px(X0=k) =

P(X1=j) ifx=k,

0 ifx6=k.

It follows that

p(i, j) =Px(X1=j|X0=i) =Pi(X1=j).

So the right hand side of (23.2) is

</div>
(38)<div class='page_container' data-page=38>

as required.

Suppose the result holds formand we want to show it holds form+ 1. We have

Ex[f1(Xn+1)· · ·fm+1(Xn+m+1)| Fn]

=Ex[Ex[fm+1(Xn+m+1)| Fn+m]f1(Xn+1)· · ·fm(Xn+m)| Fn]

Ex[EXn+m[fm+1(X1)]f1(Xn+1)· · ·fm(Xn+m)| Fn]

Ex[f1(Xn+1)· · ·fm−1(Xn+m−1)h(Xn+m)| Fn].

Here we used the result form= 1 and we definedh(y) =fn+m(y)g(y), whereg(y) =Ey[fm+1(X1)]. Using

the induction hypothesis, this is equal to

EXn[f1(X1)· · ·fm−1(Xm−1)g(Xm)] =EXn[f1(X1)· · ·fm(Xm)EXmfm+1(X1)]

=EXn[f

1(X1)· · ·fm(Xm)E[fm+1(Xm+1)| Fm]]

=EXn[f1(X1)· · ·fm+1(Xm+1)],

which is what we needed.

Define θN(ω) = (θN(ω))(ω). The strong Markov property is the same as the Markov property, but

where the fixed timenis replaced by a stopping timeN.

Theorem 23.2. IfY is bounded and measurable andN is a finite stopping time, then

Ex[Y ◦θN | FN] =EXN[Y].

Proof. We will show

Px(XN+1=j | FN) =PXN(X1=j).

Once we have this, we can proceed as in the proof of the Theorem 23.1 to obtain our result. To show the
above equality, we need to show that ifB ∈ FN, then

Px(XN+1=j, B) =Ex[PXN(X1=j);B]. (23.3)

Recall that sinceB∈ FN, then B∩(N =k)∈ Fk. We have

Px(XN+1=j, B, N=k) =Px(Xk+1=j, B, N =k)

=Ex[Px(Xk+1=j| Fk);B, N=k]

=Ex[PXk(X

1=j);B, N=k]

=Ex[PXN(X1=j);B, N=k].

Now sum overk; sinceN is finite, we obtain our desired result.

Another way of expressing the Markov property is through the Chapman-Kolmogorov equations. Let
pn(i, j) =

P(Xn =j |X0=i).

Proposition 23.3. For alli, j, m, n we have

pn+m(i, j) =X

k∈S

</div>
(39)<div class='page_container' data-page=39>

Proof. We write

P(Xn+m=j, X0=i) =

P(Xn+m=j, Xn=k, X0=i)

P(Xn+m=j|Xn=k, X0=i)P(Xn=k|X0=i)P(X0=i)

P(Xn+m=j|Xn=k)pn(i, k)P(X0=i)

pm(k, j)pn(i, k)P(X0=i).

If we divide both sides byP(X0=i), we have our result.

Note the resemblance to matrix multiplication. It is clear if P is the matrix made up of thep(i, j),
thenPn will be the matrix whose (i, j) entry ispn(i, j).

24. Recurrence and transience.
Let

Ty= min{i >0 :Xi=y}.

This is the first time thatXihits the point y. Even ifX0=ywe would have Ty >0. We letTyk be thek-th

time that the Markov chain hitsy and we set

r(x, y) =Px(Ty <∞),

the probability starting atxthat the Markov chain ever hitsy.
Proposition 24.1. Px(Tyk <∞) =r(x, y)r(y, y)k−1.

Proof. The case k= 1 is just the definition, so supposek >1. Using the strong Markov property,
Px(Tyk <∞) =P

x(T

y◦θTyk−1 <∞, T

k−1
y <∞)

=Ex[Px(Ty◦θTk−1

y <∞ | FTyk−1);T

k−1

y <∞]

=Ex[PX(Tyk−1)(T

y<∞);Tyk−1]

=Ex[Py(Ty<∞);Tyk−1<∞]

=r(y, y)Px(Tyk−1<∞).

We used here the fact that at timeTyk−1the Markov chain must be at the pointy. Repeating this argument

k−2 times yields the result.

We say thaty is recurrent ifr(y, y) = 1; otherwise we sayy is transient. Let

N(y) =
∞
X

n=1

1(Xn=y).

Proposition 24.2. y is recurrent if and only ifEyN(y) =∞.
Proof. Note

EyN(y) =
∞
X

k=1

Py(N(y)≥k) =
∞
X

k=1

Py(Tyk<∞)

=
∞
X

k=1

</div>
(40)<div class='page_container' data-page=40>

We used the fact thatN(y) is the number of visits toy and the number of visits being larger than kis the
same as the time of thek-th visit being finite. Sincer(y, y)≤1, the left hand side will be finite if and only
ifr(y, y)<1.

Observe that

EyN(y) =X

Py(Xn=y) =

pn(y, y).

If we consider simple symmetric random walk on the integers, thenpn(0,0) is 0 ifnis odd and equal

n
n/2

2−n if nis even. This is because in order to be at 0 after nsteps, the walk must have had n/2

positive steps andn/2 negative steps; the probability of this is given by the binomial distribution. Using
Stirling’s approximation, we see thatpn(0,0)∼c/√nforneven, which diverges, and so simple random walk

in one dimension is recurrent.

Similar arguments show that simple symmetric random walk is also recurrent in 2 dimensions but
transient in 3 or more dimensions.

Proposition 24.3. Ifxis recurrent andr(x, y)>0, theny is recurrent andr(y, x) = 1.

Proof. First we showr(y, x) = 1. Suppose not. Since r(x, y)>0, there is a smallest nand y1, . . . , yn−1

such thatp(x, y1)p(y1, y2)· · ·p(yn−1, y)>0. Since this is the smallestn, none of theyi can equalx. Then

Px(Tx=∞)≥p(x, y1)· · ·p(yn−1, y)(1−r(y, x))>0,

a contradiction toxbeing recurrent.

Next we show thaty is recurrent. Sincer(y, x)>0, there existsLsuch thatpL(y, x)>0. Then

pL+n+K(y, y)≥pL(y, x)pn(x, x)pK(x, y).

Summing overn,

pL+n+K(y, y)≥pL(y, x)pK(x, y)X

pn(x, x) =∞.

We say a subsetC ofS is closed ifx∈C andr(x, y)>0 impliesy∈C. A subsetD is irreducible if
x, y∈D impliesr(x, y)>0.

Proposition 24.4. LetCbe finite and closed. ThenC contains a recurrent state.

From the preceding proposition, ifCis irreducible, then all states will be recurrent.
Proof. If not, for ally we have r(y, y)<1 and

ExN(y) =

∞
X

k=1

r(x, y)r(y, y)k−1= r(x, y)

1−r(y, y)<∞.

SinceC is finite, thenP

xN(y)<∞. But that is a contradiction since

ExN(y) =
X

pn(x, y) =X

Px(Xn∈C) =

</div>
(41)<div class='page_container' data-page=41>

Theorem 24.5. LetR={x:r(x, x) = 1}, the set of recurrent states. ThenR=∪∞

i=1Ri, where eachRi is
closed and irreducible.

Proof. Sayx∼y ifr(x, y)>0. Since every state is recurrent,x∼xand ifx∼y, theny∼x. Ifx∼y and
y∼z, then pn(x, y)>0 andpm(y, z)>0 for some nandm. Thenpn+m(x, z)>0 or x∼z. Therefore we

have an equivalence relation and we let theRi be the equivalence classes.

Looking at our examples, it is easy to see that in the Ehrenfest urn model all states are recurrent. For
the branching process model, supposep(x,0)>0 for allx. Then 0 is recurrent and all the other states are
transient. In the renewal chain, there are two cases. If{k:ak >0} is unbounded, all states are recurrent.

IfK= max{k:ak>0}, then{0,1, . . . , K−1}are recurrent states and the rest are transient.

For the queueing model, letµ=Pka

k, the expected number of people arriving during one customer’s

service time. We may view this as a branching process by letting all the customers arriving during one
person’s service time be considered the progeny of that customer. It turns out that ifµ≤1, 0 is recurrent
and all other states are also. Ifµ >1 all states are transient.

25. Stationary measures.

A probabilityµis a stationary distribution if

µ(x)p(x, y) =µ(y). (25.1)

In matrix notation this isµP =µ, orµis the left eigenvector corresponding to the eigenvalue 1. In the case of
a stationary distribution,Pµ(X

1=y) =µ(y), which implies thatX1, X2, . . .all have the same distribution.

We can use (25.1) when µ is a measure rather than a probability, in which case it is called a stationary
measure.

If we have a random walk on the integers,µ(x) = 1 for allxserves as a stationary measure. In the
case of an asymmetric random walk: p(i, i+ 1) =p,p(i, i−1) =q= 1−pandp6=q, setting µ(x) = (p/q)x

also works.

In the Ehrenfest urn model,µ(x) = 2−r

r
x

works. One way to see this is thatµis the distribution
one gets if one flipsrcoins and puts a coin in the first urn when the coin is heads. A transition corresponds
to picking a coin at random and turning it over.

Proposition 25.1. Letabe recurrent and let T =Ta. Set

µ(y) =Ea

T−1

n=0

1(Xn=y).

Thenµis a stationary measure.

The idea of the proof is thatµ(y) is the expected number of visits toy by the sequenceX0, . . . , XT−1

while µP is the expected number of visits to y by X1, . . . , XT. These should be the same because XT =

X0=a.

Proof. First, letpn(a, y) =Pa(X

n =y, T > n). So

µ(y) =
∞
X

n=0

Pa(Xn=y, T > n) =

∞
X

n=0

</div>
(42)<div class='page_container' data-page=42>

and

µ(y)p(y, z) =X

∞
X

n=0

pn(a, y)p(y, z).

Second, we consider the casez6=a. Then
X

pn(a, y)p(y, z)

Pa(hity innsteps without first hittingaand then go toz in one step)

=pn+1(a, z).

µ(y)p(y, z) =X

pn(a, y)p(y, z)

=
∞
X

n=0

pn+1(a, z) =
∞
X

n=0

pn(a, z)

=µ(z)
sincep0(a, z) = 0.

Third, we consider the casea=z. Then

pn(a, y)p(y, z)

Pa(hity innsteps without first hittingaand then go toz in one step)

=Pa(T =n+ 1).

RecallPa(T = 0) = 0, and sinceais recurrent,T <∞. So

µ(y)p(y, z) =X

pn(a, y)p(y, z)

=
∞

n=0

Pa(T =n+ 1) =
∞
X

n=0

Pa(T =n) = 1.

On the other hand,

T−1

n=0

1(Xn=a)= 1(X0=a)= 1,

henceµ(a) = 1. Therefore, whetherz6=aorz=a, we haveµP(z) =µ(z).

Finally, we showµ(y)<∞. Ifr(a, y) = 0, thenµ(y) = 0. Ifr(a, y)>0, choosenso thatpn(a, y)>0,

and then

1 =µ(a) =X

µ(y)pn(a, y),

which impliesµ(y)<∞.

</div>
(43)<div class='page_container' data-page=43>

Proposition 25.2. If the Markov chain is irreducible and all states are recurrent, then the stationary
measure is unique up to a constant multiple.

Proof. Fixa∈ S. Letµa be the stationary measure constructed above and letν be any other stationary

measure.

Sinceν=νP, then

ν(z) =ν(a)p(a, z) +X

y6=a

ν(y)p(y, z)

=ν(a)p(a, z) +X

y6=a

ν(a)p(a, y)p(y, z) +X

x6=a

y6=a

ν(x)p(x, y)p(y, z)

=ν(a)Pa(X1=z) +ν(a)Pa(X16=a, X2=z) +Pν(X06=a, X16=a, X2=z).

Continuing,

ν(z) =ν(a)

m=1

Pa(X16=a, X26=a, . . . , Xm−16=a, Xm=z)

+Pν(X06=a, X16=a, . . . , Xn−16=a, Xn=z)

≥ν(a)

m=1

Pa(X16=a, X26=a, . . . , Xm−16=a, Xm=z).

Lettingn→ ∞, we obtain

ν(z)≥ν(a)µa(z).

We have

ν(a) =X

ν(x)pn(x, a)≥ν(a)X

µa(x)pn(x, a)

=ν(a)µa(a) =ν(a),

sinceµa(a) = 1 (see proof of Proposition 25.1). This means that we have equality and so

ν(x) =ν(a)µa(x)

wheneverpn(x, a)>0. Sincer(x, a)>0, this happens for somen. Consequently
ν(x)

ν(a) =µa(x).

Proposition 25.3. If a stationary distribution exists, thenµ(y)>0impliesy is recurrent.

Proof. Ifµ(y)>0, then
∞=

∞
X

n=1

µ(y) =
∞
X

n=1

µ(x)pn(x, y) =X

µ(x)
∞
X

n=1

pn(x, y)

µ(x)
∞
X

n=1

Px(Xn=y) =

µ(x)ExN(y)

µ(x)r(x, y)[1 +r(y, y) +r(y, y)2+· · ·].

Sincer(x, y)≤1 andµis a probability measure, this is less than
X

µ(x)(1 +r(y, y) +· · ·)≤1 +r(y, y) +· · ·.

Hencer(y, y) must equal 1.

</div>
(44)<div class='page_container' data-page=44>

Proposition 25.4. If the Markov chain is irreducible and has stationary distributionµ, then

µ(x) = 1
ExTx

Proof. µ(x)>0 for somex. Ify∈ S, thenr(x, y)>0 and sopn(x, y)>0 for somen. Hence

µ(y) =X

µ(x)pn(x, y)>0.

Hence by Proposition 25.3, all states are recurrent. By the uniqueness of the stationary distribution,µx is

a constant multiple ofµ, i.e., µx=cµ. Recall

µx(y) =

∞
X

n=0

Px(Xn=y, Tx> n),

and so

µx(y) =

∞
X

n=0

Px(Xn=y, Tx> n) =

Px(Xn =y, Tx> n)

Px(Tx> n) =ExTx.

Thusc=ExTx. Recalling that µx(x) = 1,

µ(x) =µx(x)

c =

1
ExTx

We make the following distinction for recurrent states. IfExTx <∞, then xis said to be positive

recurrent. Ifxis recurrent butExTx=∞,xis null recurrent.

Proposition 25.5. Suppose a chain is irreducible.

(a) If there exists a positive recurrent state, then there is a stationary distribution,.
(b) If there is a stationary distribution, all states are positive recurrent.

(d) If there exists a null recurrent state, all states are null recurrent.

Proof. To show (a), ifxis positive recurrent, then there exists a stationary measure withµ(x) = 1. Then
µ(y) =µ(y)/ExTx will be a stationary distribution.

For (b), supposeµ(x)>0 for somex. We showed this implies µ(y)>0 for ally. Then 0< µ(y) =
1/EyTy, which impliesEyTy <∞.

We showed that ifxis recurrent andr(x, y)>0, theny is recurrent. So (c) follows.

Suppose there exists a null recurrent state. If there exists a positive recurrent or transient state as
well, then by (a) and (b) or by (c) all states are positive recurrent or transient, a contradiction, and (d)
follows.

26. Convergence.

Our goal is to show that under certain conditionspn(x, y)→π(y), whereπ is the stationary

distri-bution. (In the null recurrent casepn(x, y)→0.)

Consider a random walk on the set{0,1}, where with probability one on each step the chain moves
to the other state. Then pn(x, y) = 0 if x6=y andnis even. A less trivial case is the simple random walk

on the integers. We need to eliminate this periodicity.

Supposexis recurrent, let Ix ={n≥1 :pn(x, x)>0}, and let dx be the g.c.d. (greatest common

</div>
(45)<div class='page_container' data-page=45>

Proposition 26.1. Ifr(x, y)>0, thendy=dx.

Proof. Sincexis recurrent,r(y, x)>0. ChooseK andLsuch thatpK(x, y), pL(y, x)>0.

pK+L+n(y, y)≥pL(y, x)pn(x, x)pK(x, y),

so taking n= 0, we have pK+L(y, y)>0, ord

y divides K+L. So dy divides n ifpn(x, x)>0, ordy is a

divisor ofIx. Hencedy dividesdx. By symmetrydx dividesdy.

Proposition 26.2. Ifdx= 1, there existsm0 such thatpm(x, x)>0 wheneverm≥m0.

Proof. First of all, Ix is closed under addition: ifm, n∈Ix,

pm+n(x, x)≥pm(x, x)pn(x, x)>0.

Secondly, if there existsN such thatN, N+ 1∈Ix, letm0=N2. Ifm≥m0, thenm−N2=kN+r

for somer < N and

m=r+N2+kN =r(N+ 1) + (N−r+k)N ∈Ix.

Third, pickn0∈Ixandk >0 such thatn0+k∈Ix. Ifk= 1, we are done. Sincedx= 1, there exists

n1∈Ixsuch thatkdoes not dividen1. We haven1=mk+rfor some 0< r < k. Note (m+ 1)(n0+k)∈Ix

and (m+ 1)n0+n1∈Ix. The difference between these two numbers is (m+ 1)k−n1=k−r < k. So now

we have two numbers inIk differing by less than or equal to k−1. Repeating at most ktimes, we get two

numbers inIx differing by at most 1, and we are done.

We writedfordx. A chain is aperiodic ifd= 1.

Ifd >1, we sayx∼y ifpkd(x, y)>0 for somek >0 We divideS into equivalence classesS1, . . .Sd.

Everydsteps the chain started inSi is back inSi. So we look atp0=pd onSi.

Theorem 26.3. Suppose the chain is irreducible, aperiodic, and has a stationary distribution π. Then

pn(x, y)→π(y)asn→ ∞.

Proof. The idea is to take two copies of the chain with different starting distributions, let them run
independently until they couple, i.e., hit each other, and then have them move together. So define

q((x1, y1),(x2, y2)) =

(p(x

1, x2)p(y1, y2) ifx16=y1,

p(x1, x2) ifx1=y1, x2=y2,

0 otherwise.

LetZn= (Xn, Yn) andT = min{i:Xi=Yi}. We have

P(Xn=y) =P(Xn=y, T ≤n) +P(Xn=y, T > n)

=P(Yn =y, T ≤n) +P(Xn=y, T > n),

while

P(Yn=y) =P(Yn =y, T ≤n) +P(Yn =y, T > n).

Subtracting,

P(Xn =y)−P(Yn=y)≤P(Xn=y, T > n)−P(Yn=y, T > n)

</div>
(46)<div class='page_container' data-page=46>

Using symmetry,

|P(Xn=y)−P(Yn =y)| ≤P(T > n).
Suppose we letY0 have distributionπandX0=x. Then

|pn(x, y)−π(y)| ≤

P(T > n).

It remains to showP(T > n)→0. To do this, consider another chainZn0 = (Xn, Yn), where now we

takeXn, Yn independent. Define

r((x1, y1),(x2, y2)) =p(x1, x2)p(y1, y2).

The chain under the transition probabilitiesr is irreducible. To see this, there existK andL such
that pK(x

1, x2) > 0 and pL(y1, y2) > 0. If M is large, pL+M(x2, x2) > 0 and pK+M(y2, y2) > 0. So

pK+L+M(x

1, x2)>0 andpK+L+M(y1, y2)>0, and hence we haverK+L+M((x1, x2),(y1, y2))>0.

It is easy to check thatπ0(a, b) =π(a)π(b) is a stationary distribution forZ0. HenceZn0 is recurrent,
and hence it will hit (x, x), hence the time to hit the diagonal {(y, y) : y ∈ S} is finite. However the
distribution of the time to hit the diagonal is the same asT.

27. Gaussian sequences.

We first prove a converse to Proposition 17.3.
Proposition 27.1. IfEei(uX+vY)=

EeiuXEeivY for alluandv, thenX and Y are independent random

variables.

Proof. LetX0 be a random variable with the same law asX, Y0 one with the same law asY, and X0, Y0
independent. (We let Ω = [0,1]2,PLebesgue measure,X0 a function of the first variable, andY0 a function
of the second variable defined as in Proposition 1.2.) ThenEei(uX

0+vY0)

=EeiuX

EeivY

. SinceX, X0 have
the same law, they have the same characteristic function, and similarly forY, Y0. Therefore (X0, Y0) has the
same joint characteristic function as (X, Y). By the uniqueness of the Fourier transform, (X0, Y0) has the

same joint law as (X, Y), which is easily seen to imply thatX andY are independent.

A sequence of random variables X1, . . . , Xn is said to be jointly normal if there exists a sequence

of independent standard normal random variables Z1, . . . , Zm and constants bij and ai such that Xi =

j=1bijZj+ai,i= 1, . . . , n. In matrix notation,X =BZ+A. For simplicity, in what follows let us take

A= 0; the modifications for the general case are easy. The covariance of two random variablesX andY is
defined to beE[(X−EX)(Y −EY)]. Since we are assuming our normal random variables are mean 0, we
can omit the centering at expectations. Given a sequence of mean 0 random variables, we can talk about
the covariance matrix, which is Cov (X) =EXXt, where Xtdenotes the transpose of the vectorX. In the

above case, we see Cov (X) =E[(BZ)(BZ)t] =E[BZZtBt] =BBt, sinceEZZt=I, the identity.
Let us compute the joint characteristic functionEeiu

tX

of the vectorX, whereuis ann-dimensional
vector. First, ifv is anm-dimensional vector,

Eeiv

tZ

j=1

eivjZj =

j=1

EeivjZj =

j=1

e−v2j/2=e−vtv/2

using the independence of theZ’s. So
Eeiu

tX

=Eeiu

tBZ

=e−utBBtu/2.

</div>
(47)<div class='page_container' data-page=47>

Proposition 27.2. If theXiare jointly normal andCov (Xi, Xj) = 0fori6=j, then theXiare independent.

Proof. If Cov (X) =BBtis a diagonal matrix, then the joint characteristic function of theX’s factors, and

so by Proposition 27.1, theXs would in this case be independent.

28. Stationary processes.

In this section we give some preliminaries which will be used in the next on the ergodic theorem. We
say a sequenceXi isstationaryif (Xk, Xk+1, . . .) has the same distribution as (X0, X1, . . .).

One example is if theXi are i.i.d. For readers who are familiar with Markov chains, another is ifXi

is a Markov chain,π is the stationary distribution, andX0 has distributionπ.

A third example is rotations of a circle. Let Ω be the unit circle,Pnormalized Lebesgue measure on
Ω, andθ∈[0,2π). We letX0(ω) =ω and setXn(ω) =ω+nθ(mod 2π).

A fourth example is the Bernoulli shift: let Ω = [0,1),P Lebesgue measure, X0(ω) =ω, andXn(ω)

be binary expansion ofω from thenth place on.

Proposition 28.1. IfXn is stationary, thenYk =g(Xk, Xk+1, . . .)is stationary.

Proof. IfB⊂R∞, let

A={x= (x0, x1, . . .) : (g(x0, . . .), g(x1, . . . ,), . . .)∈B}.

Then

P((Y0, Y1, . . .)∈B) =P((X0, X1, . . .)∈A)

=P((Xk, Xk+1, . . .)∈A)

=P((Yk, Yk+1, . . .)∈B).

We say thatT : Ω→Ω ismeasure preservingifP(T−1A) =

P(A) for allA∈ F.

There is a one-to-one correspondence between measure preserving transformations and stationary
sequences. GivenT, letX0=ω andXn=Tnω. Then

P((Xk, Xk+1, . . .)∈A) =P(Tk(X0, X1, . . .)∈A) =P((X0, X1, . . .)∈A).

On the other hand, ifXkis stationary, defineΩ =b R∞, and defineXbk(ω) =ωk, whereω= (ω0, ω1, . . .). Define

P onΩ so that the law ofb Xb underPb is the same as the law ofX underP. Then define T ω= (ω1, ω2, . . .).

We see that

P(A) =P((ω0, ω1, . . .)∈A) =Pb((Xb0,Xb1, . . .)∈A)

=P((X0, X1, . . .)∈A) =P((X1, X2, . . .)∈A)

=Pb((Xb1,Xb2, . . .)∈A) =P((ω1, ω2, . . .)∈A)

=P(T ω∈A) =P(T−1A).

We say a setAisinvariantifT−1A=A(up to a null set, that is, the symmetric difference has

prob-ability zero). The invariantσ-fieldI is the collection of invariant sets. A measure preserving transformation
isergodicif the invariantσ-field is trivial.

In the case of an i.i.d. sequence,Ainvariant means A=T−nA∈σ(X

n, Xn+1, . . .) for eachn. Hence

each invariant set is in the tailσ-field, and by the Kolmogorov 0-1 law,T is ergodic.

</div>
(48)<div class='page_container' data-page=48>

recall that iff is measurable and bounded, thenf is theL2 limit ofPK

k=−Kckeikx, whereck are the Fourier

coefficients. So

f(Tnx) =Xckeikx+iknθ

=Xdkeikx,

where dk =ckeiknθ. Iff(Tnx) =f(x) a.e., then ck =dk, or ckeiknθ =ck. Butθ is not a rational multiple

ofπ, soeiknθ6= 1, soc

k = 0. Thereforef = 0 a.e. If we takef = 1A, this says that eitherAis empty or A

is the whole space, up to sets of measure zero.

Our last example was the Bernoulli shift. Let Xi be i.i.d. withP(X = 1) =P(Xi = 0) = 1/2. Let

Yn = P∞m=02−(m+1)Xn+m. So there exists g such that Yn = g(Xn, Xn+1, . . .). If A is invariant for the

Bernoulli shift,

A= ((Yn, Yn+1, . . .)∈B) = ((Xn, Xn+1, . . .)∈C),

where C={x: (g(x0, x1, . . .), g(x1, x2, . . .), . . .)∈B}. this is true for all n, soA is in the invariant σ-field

for theXi’s, which is trivial. ThereforeT is ergodic.

29. The ergodic theorem.

The key to the ergodic theorem is the following maximal lemma.

Lemma 29.1. LetX be integrable. LetT be a measure preserving transformation, letXj(ω) =X(Tjω),
letSk(ω) =X0(ω) +· · ·+Xk−1(ω), and Mk(ω) = max(0, S1(ω), . . . , Sk(ω)). ThenE[X;Mk >0]≥0.

Proof. Ifj≤k,Mk(T ω)≥Sj(T ω), so X(ω) +Mk(T ω)≥X(ω) +Sj(T ω) =Sj+1(ω), or

X(ω)≥Sj+1(ω)−Mk(T ω), j= 1, . . . , k.

SinceS1(ω) =X(ω) andMk(T ω)≥0, then

X(ω)≥S1(ω)−Mk(T ω).

Therefore

E[X(ω);Mk>0]≥

(Mk>0)

[max(S1, . . . , Sk)(ω)−Mk(T ω)]

=
Z

(Mk>0)

[Mk(ω)−Mk(T ω)].

On the set (Mk = 0) we haveMk(ω)−Mk(T ω) =−Mk(T ω)≤0. Hence

E[X(ω);Mk >0]≥

[Mk(ω)−Mk(T ω)].

SinceT is measure preserving,EMk(ω)−EMk(T ω) = 0, which completes the proof.

RecallI is the invariantσ-field. The ergodic theorem says the following.
Theorem 29.2. LetT be measure preserving andX integrable. Then

1
n

n−1

m=0

X(Tmω)→E[X | I],

where the convergence takes place almost surely and inL1.

</div>
(49)<div class='page_container' data-page=49>

Letδ >0. SinceX is integrable, P

P(|Xn(ω)|> δn) =PP(|X|> δn)<∞(cf. proof of Proposition
5.1). By Borel-Cantelli,|Xn|/nwill eventually be less than δ. Sinceδis arbitrary,|Xn|/n→0 a.s. Since

(Sn/n)(T ω)−(Sn/n)(ω) =Xn(ω)/n−X0(ω)/n→0,

then lim sup(Sn/n)(T ω) = lim sup(Sn/n)(ω), and soD ∈ I. Let X∗(ω) = (X(ω)−ε)1D(ω), and defineSn∗

andMn∗ analogously to the definitions ofSn andMn. OnD, lim sup(Sn/n)> ε, hence lim sup(Sn∗/n)>0.

Let F = ∪n(Mn∗ > 0). Note ∪ni=0(Mi∗ > 0) = (Mn∗ > 0). Also |X∗| ≤ |X|+ε is integrable. By

Lemma 29.1,E[X∗;Mn∗>0]≥0. By dominated convergence, E[X∗;F]≥0.

We claim D =F, up to null sets. To see this, if lim sup(Sn∗/n)>0, then ω ∈ ∪n(Mn∗ >0). Hence

D⊂F. On the other hand, ifω∈F, then Mn∗>0 for some n, soXn∗6= 0 for some n. By the definition of
X∗, for some n,Tnω∈D, and sinceDis invariant,ω∈D a.s.

RecallD∈ I. Then

0≤E[X∗;D] =E[X−ε;D]

=E[E[X | I];D]−εP(D) =−εP(D),
using the fact thatE[X | I] = 0. We concludeP(D) = 0 as desired.

Since we have this for everyε, then lim supSn/n ≤0. By applying the same argument to −X, we

obtain lim infSn/n≥0, and we have proved the almost sure result. Let us now turn to theL1convergence.

LetM >0,XM0 =X1(|X|≤M), andXM00 =X−XM00. By the almost sure result,

1
n

XM0 (Tmω)→E[XM0 | I]

almost surely. Both sides are bounded byM, so

1
n

XM0 (Tmω)−E[XM0 | I]

→0. (29.1)

Letε > 0 and chooseM large so that E|X00

M| < ε; this is possible by dominated convergence. We

have

E

1
n

n−1

m=0

XM00(Tmω)

≤

1
n

E|XM00(Tmω)|=E|XM00 | ≤ε

and

E|E[XM00 | I]| ≤E[E[|X

M| | I]] =E|X

M| ≤ε.

So combining with (29.1)

lim supE

1
n

X(Tmω)−E[X | I]
≤2ε.
This shows theL1 convergence.

What does the ergodic theorem tell us about our examples? In the case of i.i.d. random variables, we
seeSn/n→EX almost surely and in L1, since E[X |I] =EX. Thus this gives another proof of the SLLN.
For rotations of the circle withX(ω) = 1A(ω) andθ is an irrational multiple ofπ,E[X | I] =EX=
P(A), the normalized Lebesgue measure of A. So the ergodic theorem says that (1/n)P1A(ω+nθ), the

average number of timesω+nθ is inA, converges for almost everyω to the normalized Lebesgue measure
ofA.

</div>

The basic problems of probability theory

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về