Tải bản đầy đủ (.pdf) (49 trang)

The basic problems of probability theory

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (328.82 KB, 49 trang )

<span class='text_page_counter'>(1)</span><div class='page_container' data-page=1>

Probability Theory
Richard F. Bass


These notes are c1998 by Richard F. Bass. They may be used for personal or classroom purposes,
but not for commercial purposes.


Revised 2001.


1. Basic notions.


A probability or probability measureis a measure whose total mass is one. Because the origins of
probability are in statistics rather than analysis, some of the terminology is different. For example, instead
of denoting a measure space by (X,A, µ), probabilists use (Ω,F,P). So here Ω is a set,F is called aσ-field
(which is the same thing as aσ-algebra), andPis a measure withP(Ω) = 1. Elements ofF are calledevents.
Elements of Ω are denotedω.


Instead of saying a property occurs almost everywhere, we talk about properties occurring almost
surely, written a.s.. Real-valued measurable functions from Ω to <sub>R</sub> are called random variables and are
usually denoted byX orY or other capital letters. We often abbreviate ”random variable” byr.v.


We letAc <sub>= (ω</sub><sub>∈</sub><sub>Ω :</sub><sub>ω /</sub><sub>∈</sub><sub>A) (called the</sub><sub>complement</sub><sub>of</sub><sub>A) and</sub><sub>B</sub><sub>−</sub><sub>A</sub><sub>=</sub><sub>B</sub><sub>∩</sub><sub>A</sub>c<sub>.</sub>


Integration (in the sense of Lebesgue) is calledexpectationor expected value, and we write<sub>E</sub>X for
R


Xd<sub>P</sub>. The notation<sub>E</sub>[X;A] is often used forR


AXdP.


The random variable 1A is the function that is one if ω ∈ A and zero otherwise. It is called the
indicatorofA(the name characteristic function in probability refers to the Fourier transform). Events such


as (ω:X(ω)> a) are almost always abbreviated by (X > a).


Given a random variableX, we can define a probability onRby


PX(A) =P(X ∈A), A⊂R. (1.1)


The probabilityPX is called thelawofX or thedistributionofX. We defineFX:R→[0,1] by


FX(x) =PX((−∞, x]) =P(X ≤x). (1.2)


The functionFX is called thedistribution functionofX.


As an example, let Ω = {H, T}, F all subsets of Ω (there are 4 of them), <sub>P</sub>(H) = <sub>P</sub>(T) = 1<sub>2</sub>. Let
X(H) = 1 andX(T) = 0. Then <sub>P</sub>X = 1<sub>2</sub>δ0+<sub>2</sub>1δ1, whereδx is point mass atx, that is,δx(A) = 1 if x∈A


and 0 otherwise. FX(a) = 0 ifa <0, 1<sub>2</sub> if 0≤a <1, and 1 ifa≥1.


Proposition 1.1. The distribution functionFX of a random variableX satisfies:
(a) FX is nondecreasing;


(b) FX is right continuous with left limits;
(c) limx→∞FX(x) = 1andlimx→−∞FX(x) = 0.


Proof. We prove the first part of (b) and leave the others to the reader. Ifxn↓x, then (X ≤xn)↓(X ≤x),


and soP(X≤xn)↓P(X ≤x) sinceP is a measure.


Note that ifxn↑x, then (X ≤xn)↑(X < x), and soFX(xn)↑P(X < x).


</div>
<span class='text_page_counter'>(2)</span><div class='page_container' data-page=2>

Proposition 1.2. Suppose F is a distribution function. There exists a random variable X such that



F =FX.


Proof. Let Ω = [0,1],F the Borelσ-field, andPLebesgue measure. Define X(ω) = sup{x:F(x)< ω}. It
is routine to check thatFX =F.


In the above proof, essentially X = F−1. However F may have jumps or be constant over some
intervals, so some care is needed in definingX.


Certain distributions or laws are very common. We list some of them.


(a) Bernoulli. A random variable is Bernoulli if <sub>P</sub>(X= 1) =p,<sub>P</sub>(X = 0) = 1−pfor somep∈[0,1].
(b) Binomial. This is defined by<sub>P</sub>(X =k) =



n
k




pk<sub>(1</sub><sub>−</sub><sub>p)</sub>n−k<sub>, where</sub><sub>n</sub><sub>is a positive integer, 0</sub><sub>≤</sub><sub>k</sub><sub>≤</sub><sub>n,</sub>


andp∈[0,1].


(c) Geometric. Forp∈(0,1) we set<sub>P</sub>(X =k) = (1−p)pk<sub>. Here</sub><sub>k</sub> <sub>is a nonnegative integer.</sub>


(d) Poisson. Forλ >0 we set<sub>P</sub>(X=k) =e−λ<sub>λ</sub>k<sub>/k! Again</sub><sub>k</sub><sub>is a nonnegative integer.</sub>


(e) Uniform. For some positive integern, setP(X =k) = 1/nfor 1≤k≤n.



If F is absolutely continuous, we call f = F0 the density of F. Some examples of distributions
characterized by densities are the following.


(f) Uniform on[a, b]. Definef(x) = (b−a)−1<sub>1</sub>


[a,b](x). This means that ifX has a uniform distribution,


then


P(X ∈A) =
Z


A


1


b−a1[a,b](x)dx.
(g) Exponential. Forx >0 letf(x) =λe−λx<sub>.</sub>


(h) Standard normal. Definef(x) =√1


2πe


−x2/2<sub>. So</sub>


P(X∈A) =
1




Z


A


e−x2/2dx.


(i) N(µ, σ2<sub>). We shall see later that a standard normal has mean zero and variance one. If</sub> <sub>Z</sub> <sub>is a</sub>


standard normal, then a N(µ, σ2<sub>) random variable has the same distribution as</sub> <sub>µ</sub><sub>+</sub><sub>σZ. It is an</sub>


exercise in calculus to check that such a random variable has density
1



2πσe


−(x−µ)2<sub>/2σ</sub>2


. (1.3)


(j) Cauchy. Here


f(x) = 1
π


1
1 +x2.


We can use the law of a random variable to calculate expectations.
Proposition 1.3. Ifg is bounded or nonnegative, then



Eg(X) =
Z


g(x)PX(dx).


Proof. Ifgis the indicator of an eventA, this is just the definition ofPX. By linearity, the result holds for


</div>
<span class='text_page_counter'>(3)</span><div class='page_container' data-page=3>

If FX has a density f, then PX(dx) = f(x)dx. So, for example, EX = Rxf(x)dx and EX2 =
R


x2<sub>f</sub><sub>(x)</sub><sub>dx. (We need</sub>


E|X|finite to justify this if X is not necessarily nonnegative.)


We define themeanof a random variable to be its expectation, and thevarianceof a random variable
is defined by


VarX=E(X−EX)2.


For example, it is routine to see that the mean of a standard normal is zero and its variance is one.
Note


VarX =E(X2−2XEX+ (EX)2) =EX2−(EX)2.


Another equality that is useful is the following.
Proposition 1.4. IfX ≥0a.s. andp >0, then


EXp=
Z ∞



0


pλp−1<sub>P</sub>(X > λ)dλ.


The proof will show that this equality is also valid if we replaceP(X > λ) byP(X≥λ).
Proof. Use Fubini’s theorem and write


Z ∞


0


pλp−1P(X > λ)dλ=E
Z ∞


0


pλp−11(λ,∞)(X)dλ=E


Z X


0


pλp−1dλ=EXp.


We need two elementary inequalities.


Proposition 1.5. Chebyshev’s inequality IfX≥0,


P(X ≥a)≤ E


X
a .


Proof. We write


P(X ≥a) =E
h


1[a,∞)(X)


i


≤<sub>E</sub>hX


a1[a,∞)(X)
i


≤<sub>E</sub>X/a,


sinceX/ais bigger than 1 when X∈[a,∞).


If we apply this toX = (Y −<sub>E</sub>Y)2<sub>, we obtain</sub>


P(|Y −EY| ≥a) =P((Y −EY)2≥a2)≤VarY /a2. (1.4)


This special case of Chebyshev’s inequality is sometimes itself referred to as Chebyshev’s inequality, while
Proposition 1.5 is sometimes called the Markov inequality.


</div>
<span class='text_page_counter'>(4)</span><div class='page_container' data-page=4>

Proposition 1.6. Supposegis convex and and X andg(X)are both integrable. Then



g(EX)≤Eg(X).


Proof. One property of convex functions is that they lie above their tangent lines, and more generally their
support lines. So ifx0∈R, we have


g(x)≥g(x0) +c(x−x0)


for some constantc. Takex=X(ω) and take expectations to obtain


Eg(X)≥g(x0) +c(EX−x0).


Now setx0 equal toEX.


IfAn is a sequence of sets, define (An i.o.), read ”An infinitely often,” by


(An i.o.) =∩∞n=1∪




i=nAi.


This set consists of thoseωthat are in infinitely many of theAn.


A simple but very important proposition is the Borel-Cantelli lemma. It has two parts, and we prove
the first part here, leaving the second part to the next section.


Proposition 1.7. (Borel-Cantelli lemma)IfP


nP(An)<∞, thenP(An i.o.) = 0.



Proof. We have


P(An i.o.) = lim
n→∞P(∪




i=nAi).


However,


P(∪∞i=nAi)≤



X


i=n


P(Ai),


which tends to zero asn→ ∞.


2. Independence.


Let us say two eventsAandB areindependentifP(A∩B) =P(A)P(B). The eventsA1, . . . , An are


independent if


P(Ai1∩Ai2∩ · · · ∩Aij) =P(Ai1)P(Ai2)· · ·P(Aij)



for every subset{i1, . . . , ij} of{1,2, . . . , n}.


Proposition 2.1. IfAandB are independent, thenAc <sub>and</sub><sub>B</sub> <sub>are independent.</sub>


Proof. We write


P(Ac∩B) =P(B)−P(A∩B) =P(B)−P(A)P(B) =P(B)(1−P(A)) =P(B)P(A).


</div>
<span class='text_page_counter'>(5)</span><div class='page_container' data-page=5>

{(X ∈A) :Aa Borel subset ofR}.) We define the independence ofn σ-fields ornrandom variables in the
obvious way.


Proposition 2.1 tells us thatA and B are independent if the random variables 1A and 1B are


inde-pendent, so the definitions above are consistent.


Iff andg are Borel functions andX and Y are independent, thenf(X) and g(Y) are independent.
This follows because theσ-field generated byf(X) is a sub-σ-field of the one generated byX, and similarly
forg(Y).


LetFX,Y(x, y) =P(X ≤x, Y ≤y) denote the joint distribution function ofX and Y. (The comma
inside the set means ”and.”)


Proposition 2.2. FX,Y(x, y) =FX(x)FY(y)if and only ifX andY are independent.


Proof. IfXandY are independent, the 1(−∞,x](X) and 1(−∞,y](Y) are independent by the above comments.


Using the above comments and the definition of independence, this showsFX,Y(x, y) =FX(x)FY(y).


Conversely, if the inequality holds, fixy and letMy denote the collection of setsA for whichP(X ∈
A, Y ≤y) =P(X∈A)P(Y ≤y). My contains all sets of the form (−∞, x]. It follows by linearity thatMy



contains all sets of the form (x, z], and then by linearity again, by all sets that are the finite union of such
half-open, half-closed intervals. Note that the collection of finite unions of such intervals, A, is an algebra
generating the Borelσ-field. It is clear thatMy is a monotone class, so by the monotone class lemma,My


contains the Borelσ-field.


For a fixed set A, let MA denote the collection of sets B for which P(X ∈ A, Y ∈ B) = P(X ∈
A)P(Y ∈B). Again,MAis a monotone class and by the preceding paragraph contains theσ-field generated


by the collection of finite unions of intervals of the form (x, z], hence contains the Borel sets. ThereforeX
andY are independent.


The following is known as the multiplication theorem.


Proposition 2.3. IfX,Y, andXY are integrable andX and Y are independent, thenEXY =EXEY.
Proof. Consider the random variables in σ(X) (the σ-field generated by X) and σ(Y) for which the
multiplication theorem is true. It holds for indicators by the definition of X and Y being independent.
It holds for simple random variables, that is, linear combinations of indicators, by linearity of both sides.
It holds for nonnegative random variables by monotone convergence. And it holds for integrable random
variables by linearity again.


Let us give an example of independent random variables. Let Ω = Ω1×Ω2and letP=P1×P2, where


(Ωi,Fi,Pi) are probability spaces fori= 1,2. We use the product σ-field. Then it is clear thatF1 and F2


are independent by the definition of<sub>P</sub>. IfX1is a random variable such thatX1(ω1, ω2) depends only onω1


andX2 depends only onω2, then X1 andX2 are independent.



This example can be extended tonindependent random variables, and in fact, if one has independent
random variables, one can always view them as coming from a product space. We will not use this fact.
Later on, we will talk about countable sequences of independent r.v.s and the reader may wonder whether
such things can exist. That it can is a consequence of the Kolmogorov extension theorem; see PTA, for
example.


If X1, . . . , Xn are independent, then so are X1−EX1, . . . , Xn −EXn. Assuming everything is


integrable,


</div>
<span class='text_page_counter'>(6)</span><div class='page_container' data-page=6>

using the multiplication theorem to show that the expectations of the cross product terms are zero. We have
thus shown


Var (X1+· · ·+Xn) = VarX1+· · ·+ VarXn. (2.1)


We finish up this section by proving the second half of the Borel-Cantelli lemma.


Proposition 2.4. SupposeAnis a sequence of independent events. IfPnP(An) =∞, thenP(An i.o.) = 1.


Note that here the An are independent, while in the first half of the Borel-Cantelli lemma no such


assumption was necessary.
Proof. Note


P(∪Ni=nAi) = 1−P(∩Ni=nAci) = 1−
N


Y


i=n



P(Aci) = 1−
N


Y


i=n


(1−P(Ai)).


By the mean value theorem, 1−x≤e−x<sub>, so we have that the right hand side is greater than or equal to</sub>


1−exp(−PN


i=nP(Ai)).As N → ∞, this tends to 1, soP(∪∞i=nAi) = 1. This holds for alln, which proves


the result.


3. Convergence.


In this section we consider three ways a sequence of r.v.sXn can converge.


We say Xn converges toX almost surelyif (Xn 6→X) has probability zero. Xn converges toX in
probability if for eachε, P(|Xn−X|> ε)→0 asn→ ∞. Xn converges to X in Lp ifE|Xn−X|p →0 as


n→ ∞.


The following proposition shows some relationships among the types of convergence.
Proposition 3.1. (a) IfXn→X a.s., thenXn→X in probability.



(b) If Xn→X in Lp, thenXn→X in probability.


(c) IfXn→X in probability, there exists a subsequencenj such thatXnj converges toX almost surely.


Proof. To prove (a), noteXn−X tends to 0 almost surely, so 1(−ε,ε)c(Xn−X) also converges to 0 almost


surely. Now apply the dominated convergence theorem.
(b) comes from Chebyshev’s inequality:


P(|Xn−X|> ε) =P(|Xn−X|p> εp)≤E|Xn−X|p/εp→0


asn→ ∞.


To prove (c), choose nj larger than nj−1 such that P(|Xn−X| > 2−j) < 2−j whenever n ≥ nj.


So if we let Ai = (|Xnj −X| > 2


−i <sub>for some</sub><sub>j</sub> <sub>≥</sub> <sub>i), then</sub>


P(Ai) ≤ 2−i+1. By the Borel-Cantelli lemma


P(Ai i.o.) = 0. This impliesXnj →X on the complement of (Ai i.o.).


Let us give some examples to show there need not be any other implications among the three types
of convergence.


Let Ω = [0,1],F the Borelσ-field, and<sub>P</sub> Lebesgue measure. LetXn =en1(0,1/n). Then clearlyXn


converges to 0 almost surely and in probability, but<sub>E</sub>Xp



n=enp/n→ ∞for anyp.


Let Ω be the unit circle, and let<sub>P</sub> be Lebesgue measure on the circle normalized to have total mass
1. Let tn =Pn<sub>i=1</sub>i−1, and letAn ={θ:tn−1 ≤θ < tn}. LetXn = 1An. Any point on the unit circle will


be in infinitely manyAn, so Xn does not converge almost surely to 0. ButP(An) = 1/2πn→0, soXn→0


</div>
<span class='text_page_counter'>(7)</span><div class='page_container' data-page=7>

4. Weak law of large numbers.


SupposeXn is a sequence of independent random variables. Suppose also that they all have the same


distribution, that is, FXn =FX1 for all n. This situation comes up so often it has a name, independent,


identically distributed, which is abbreviatedi.i.d.


DefineSn=P
n


i=1Xi. Sn is called a partial sum process. Sn/n is the average value of the firstnof


theXi’s.


Theorem 4.1. (Weak law of large numbers) Suppose theXi are i.i.d. andEX12<∞. ThenSn/n→EX1
in probability.


Proof. Since theXi are i.i.d., they all have the same expectation, and soESn=nEX1. HenceE(Sn/n−


EX1)2 is the variance ofSn/n. Ifε >0, by Chebyshev’s inequality,


P(|Sn/n−EX1|> ε)≤



Var (Sn/n)


ε2 =


Pn


i=1VarXi


n2<sub>ε</sub>2 =


nVarX1


n2<sub>ε</sub>2 . (4.1)


SinceEX12<∞, then VarX1<∞, and the result follows by lettingn→ ∞.


A nice application of the weak law of large numbers is a proof of the Weierstrass approximation
theorem.


Theorem 4.2. Supposef is a continuous function on [0,1]and ε >0. There exists a polynomialP such
thatsup<sub>x</sub><sub>∈</sub><sub>[0,1]</sub>|f(x)−P(x)|< ε.


Proof. Let


P(x) =


n


X



k=0


f(k/n)


n
k




xk(1−x)n−k.


ClearlyP is a polynomial. Sincef is continuous, there exists M such that|f(x)| ≤M for allxand there
existsδsuch that |f(x)−f(y)|< ε/2 whenever|x−y|< δ.


LetXi be i.i.d. Bernoulli r.v.s with parameterx. ThenSn, the partial sum, is a binomial, and hence


P(x) =Ef(Sn/n). The mean ofSn/n isx. We have


|P(x)−f(x)|=|<sub>E</sub>f(Sn/n)−f(EX1)| ≤E|f(Sn/n)−f(EX1)|


≤MP(|Sn/n−x|> δ) +ε/2.


By (4.1) the first term will be less than


MVarX1/nδ2≤M x(1−x)/nδ2≤M nδ2,


which will be less thanε/2 ifnis large enough, uniformly inx.



5. Techniques related to almost sure convergence.


Our aim is the strong law of large numbers (SLLN), which says thatSn/n converges toEX1 almost


surely ifE|X1|<∞.


</div>
<span class='text_page_counter'>(8)</span><div class='page_container' data-page=8>

Proposition 5.1. SupposeXi is an i.i.d. sequence withEXi4 <∞and letSn =P
n


i=1Xi. ThenSn/n→


EX1 a.s.


Proof. By looking atXi−EXi we may assume that theXi have mean 0. By Chebyshev,


P(|Sn/n|> ε)≤ E


(Sn/n)4


ε4 =


ESn4


n4<sub>ε</sub>4.


If we expand S<sub>n</sub>4, we will have terms involving X<sub>i</sub>4, terms involvingX<sub>i</sub>2X<sub>j</sub>2, terms involving X<sub>i</sub>3Xj, terms


involvingX<sub>i</sub>2XjXk, and terms involvingXiXjXkX`, withi, j, k, `all being different. By the multiplication


theorem and the fact that theXi have mean 0, the expectations of all the terms will be 0 except for those



of the first two types. So


ESn4 =
n


X


i=1


EXi4+


X


i6=j


EXi2EXj2.


By the finiteness assumption, the first term on the right is bounded byc1n. By Cauchy-Schwarz,EXi2 ≤


(EXi4)


1/2 <sub><</sub> <sub>∞, and there are at most</sub> <sub>n</sub>2 <sub>terms in the second term on the right, so this second term is</sub>


bounded byc2n2. Substituting, we have


P(|Sn/n|> ε)≤c3/n2ε4.


ConsequentlyP(|Sn/n|> εi.o.) = 0 by Borel-Cantelli. Sinceεis arbitrary, this impliesSn/n→0 a.s.



Before we can prove the SLLN assuming only the finiteness of first moments, we need some
prelimi-naries.


Proposition 5.2. IfY ≥0, thenEY <∞if and only ifPnP(Y > n)<∞.


Proof. By Proposition 1.4, <sub>E</sub>Y = R∞


0 P(Y > x)dx. P(Y > x) is nonincreasing in x, so the integral is


bounded above byP∞<sub>n=0</sub><sub>P</sub>(Y > n) and bounded below byP∞<sub>n=1</sub><sub>P</sub>(Y > n).


If Xi is a sequence of r.v.s, the tail σ-field is defined by ∩∞n=1σ(Xn, Xn+1, . . .). An example of an


event in the tailσ-field is (lim supn→∞Xn > a). Another example is (lim supn→∞Sn/n > a). The reason


for this is that ifk < nis fixed,


Sn


n =
Sk


n +
Pn


i=k+1Xi


n .


The first term on the right tends to 0 as n → ∞. So lim supSn/n = lim sup(P


n


i=k+1Xi)/n, which is in


σ(Xk+1, Xk+2, . . .). This holds for eachk. The set (lim supSn> a) is easily seen not to be in the tailσ-field.


Theorem 5.3. (Kolmogorov 0-1 law)If the Xi are independent, then the events in the tail σ-field have
probability 0 or 1.


This implies that in the case of i.i.d. random variables, ifSn/n has a limit with positive probability,


then it has a limit with probability one, and the limit must be a constant.


Proof. LetM be the collection of sets inσ(Xn+1, . . .) that is independent of every set inσ(X1, . . . , Xn).


M is easily seen to be a monotone class and it containsσ(Xn+1, . . . , XN) for every N > n. Therefore M


must be equal toσ(Xn+1, . . .).


IfAis in the tail σ-field, then for eachn, Ais independent ofσ(X1, . . . , Xn). The class MA of sets


independent ofAis a monotone class, hence is aσ-field containingσ(X1, . . . , Xn) for eachn. ThereforeMA


</div>
<span class='text_page_counter'>(9)</span><div class='page_container' data-page=9>

We thus have that the eventA is independent of itself, or


P(A) =P(A∩A) =P(A)P(A) =P(A)2.
This implies<sub>P</sub>(A) is zero or one.


The next proposition shows that in considering a law of large numbers we can consider truncated
random variables.



Proposition 5.4. SupposeXi is an i.i.d. sequence of r.v.s withE|X1|<∞. LetXn0 =Xn1(|Xn|≤n). Then


(a) Xn converges almost surely if and only ifXn0 does;
(b) If S<sub>n</sub>0 =Pn


i=1X


0


i, thenSn/nconverges a.s. if and only ifSn0/ndoes.


Proof. LetAn= (Xn 6=Xn0) = (|Xn|> n). ThenP(An) =P(|Xn|> n) =P(|X1|> n). Since E|X1|<∞,


then by Proposition 5.2 we have P


P(An)<∞. So by the Borel-Cantelli lemma, P(An i.o.) = 0. Thus for


almost everyω,Xn=Xn0 fornsufficiently large. This proves (a).


For (b), letk(depending onω) be the largest integer such thatX<sub>k</sub>0(ω)6=Xk(ω). ThenSn/n−Sn0/n=


(X1+· · ·+Xk)/n−(X10 +· · ·+Xk0)/n→0 asn→ ∞.


Next is Kolmogorov’s inequality, a special case of Doob’s inequality.


Proposition 5.5. Suppose the Xi are independent andEXi = 0for eachi. Then


P( max



1≤i≤n|Si| ≥λ)≤


ESn2


λ2 .


Proof. LetAk = (|Sk| ≥λ,|S1|< λ, . . . ,|Sk−1|< λ). Note theAkare disjoint and thatAk∈σ(X1, . . . , Xk).


ThereforeAk is independent ofSn−Sk. Then


ESn2≥
n


X


k=1


E[Sn2;Ak]


=


n


X


k=1


E[(Sk2+ 2Sk(Sn−Sk) + (Sn−Sk)2);Ak]





n


X


k=1


E[Sk2;Ak] + 2
n


X


k=1


E[Sk(Sn−Sk);Ak].


Using the independence,<sub>E</sub>[Sk(Sn−Sk)1Ak] =E[Sk1Ak]E[Sn−Sk] = 0. Therefore


ESn2 ≥
n


X


k=1


E[S2k;Ak]≥
n


X



k=1


λ2<sub>P</sub>(Ak) =λ2P( max


1≤k≤n|Sk| ≥λ).


Our result is immediate from this.


The last result we need for now is a special case of what is known as Kronecker’s lemma.
Proposition 5.6. Supposexiare real numbers andsn=P


n
i=1xi. If


P∞


j=1(xj/j)converges, thensn/n→0.


Proof. Letbn =P
n


j=1(xj/j),b0= 0, and supposebn→b. As is well known, this implies (P
n


i=1bi)/n→b.


We haven(bn−bn−1) =xn, so


sn



n =
Pn


i=1(ibi−ibi−1)


n =


Pn


i=1ibi−P
n−1


i=1(i+ 1)bi


n =bn−


Pn


i=1bi


</div>
<span class='text_page_counter'>(10)</span><div class='page_container' data-page=10>

6. Strong law of large numbers.


This section is devoted to a proof of Kolmogorov’s strong law of large numbers. We showed earlier
that ifEXi2<∞, where theXiare i.i.d., then the weak law of large numbers (WLLN) holds: Sn/nconverges


toEX1in probability. The WLLN can be improved greatly; it is enough thatxP(|X1|> x)→0 asx→ ∞.


Here we show the strong law (SLLN): if one has a finite first moment, then there is almost sure convergence.
First we need a lemma.



Lemma 6.1. Suppose Vi is a sequence of independent r.v.s, each with mean 0. Let Wn = P
n


i=1Vi. If


P∞


i=1VarVi<∞, thenWn converges almost surely.


Proof. Choosenj> nj−1such that


P∞


i=njVarVi<2


−3j<sub>. If</sub><sub>n > n</sub>


j, then applying Kolmogorov’s inequality


shows that


P( max


nj≤i≤n


|Wi−Wnj|>2


−j<sub>)</sub><sub>≤</sub><sub>2</sub>−3j<sub>/2</sub>−2j <sub>= 2</sub>−j<sub>.</sub>


Lettingn→ ∞, we haveP(Aj)≤2−j, where



Aj = (max
nj≤i


|Wi−Wnj|>2


−j<sub>).</sub>


By the Borel-Cantelli lemma,<sub>P</sub>(Aj i.o.) = 0.


Supposeω /∈(Aj i.o.). Letε >0. Choosejlarge enough so that 2−j+1< εandω /∈Aj. Ifn, m > nj,


then


|Wn−Wm| ≤ |Wn−Wnj|+|Wm−Wnj| ≤2


−j+1<sub>< ε.</sub>


Sinceεis arbitrary,Wn(ω) is a Cauchy sequence, and hence converges.


Theorem 6.2. (SLLN)LetXibe a sequence of i.i.d. random variables. ThenSn/nconverges almost surely
if and only if<sub>E</sub>|X1|<∞.


Proof. Let us first supposeSn/n converges a.s. and showE|X1|<∞. IfSn(ω)/n→a, then


Sn−1


n =
Sn−1



n−1
n−1


n →a.
So


Xn


n =
Sn


n −
Sn−1


n →a−a= 0.


HenceXn/n→0, a.s. ThusP(|Xn|> ni.o.) = 0. By the second part of Borel-Cantelli,PP(|Xn|> n)<∞.


Since theXi are i.i.d., this means


P∞


n=1P(|X1|> n)<∞, and by Proposition 4.1,E|X1|<∞.


Now suppose <sub>E</sub>|X1|<∞. By looking atXi−EXi, we may suppose without loss of generality that


EXi= 0. We truncate, and letYi=Xi1(|Xi|≤i). It suffices to show


Pn



i=1Yi/n→0 a.s., by Proposition 5.4.


Next we estimate. We have


</div>
<span class='text_page_counter'>(11)</span><div class='page_container' data-page=11>

The convergence follows by the dominated convergence theorem, since the integrands are bounded by|X1|.


To estimate the second moment of theYi, we write


EYi2=


Z ∞


0


2y<sub>P</sub>(|Yi| ≥y)dy


=
Z i


0


2yP(|Yi| ≥y)dy



Z i


0


2yP(|X1| ≥y)dy,



and so



X


i=1


E(Yi2/i
2<sub>)</sub><sub>≤</sub>



X


i=1


1
i2


Z i


0


2yP(|X1| ≥y)dy


= 2

X


i=1



1
i2


Z ∞


0


1(y≤i)yP(|X1| ≥y)dy


= 2
Z ∞


0



X


i=1


1


i21(y≤i)yP(|X1| ≥y)dy


≤4
Z ∞


0


1



yyP(|X1| ≥y)dy
= 4


Z ∞


0


P(|X1| ≥y)dy= 4E|X1|<∞.


LetUi=Yi−EYi. Then VarUi= VarYi≤EYi2, and by the above,



X


i=1


Var (Ui/i)<∞.


By Lemma 6.1 (withVi=Ui/i),P
n


i=1(Ui/i) converges almost surely. By Kronecker’s lemma, (P
n


i=1Ui)/n


converges almost surely. Finally, sinceEYi→0, thenP
n


i=1EYi/n→0, henceP


n


i=1Yi/n→0.


7. Uniform integrability. Before proceeding to some extensions of the SLLN, we discuss uniform
inte-grability. A sequence of r.v.s isuniformly integrable if


sup


i


Z


(|Xi|>M)


|Xi|dP→0


asM → ∞.


Proposition 7.1. Suppose there existsϕ: [0,∞)→[0,∞)such that ϕis nondecreasing, ϕ(x)/x→ ∞as


x→ ∞, andsup<sub>i</sub><sub>E</sub>ϕ(|Xi|)<∞. Then theXi are uniformly integrable.


Proof. Letε >0 and choosex0 such thatx/ϕ(x)< εifx≥x0. IfM ≥x0,


Z


(|Xi|>M)


|Xi|=



Z <sub>|X</sub>


i|


ϕ(|Xi|)


ϕ(|Xi|)1(|Xi|>M)≤ε


Z


ϕ(|Xi|)≤εsup
i E


</div>
<span class='text_page_counter'>(12)</span><div class='page_container' data-page=12>

Proposition 7.2. IfXn andYn are two uniformly integrable sequences, thenXn+Yn is also a uniformly
integrable sequence.


Proof. Since there exists M0 such that supnE[|Xn|;|Xn| > M0] < 1 and supnE[|Yn|;|Yn| > M0] < 1,


then sup<sub>n</sub><sub>E</sub>|Xn| ≤ M0+ 1, and similarly for theYn. Let ε >0 and choose M1 >4(M0+ 1)/ε such that


sup<sub>n</sub><sub>E</sub>[|Xn|;|Xn|> M1]< ε/4 and supnE[|Yn|;|Yn|> M1]< ε/4. LetM2= 4M12.


Note<sub>P</sub>(|Xn|+|Yn|> M2)≤(E|Xn|+E|Yn|)/M2≤ε/(4M1) by Chebyshev’s inequality. Then


E[|Xn+Yn|;|Xn+Yn|> M2]≤E[|Xn|;|Xn|> M1]


+E[|Xn|;|Xn| ≤M1,|Xn+Yn|> M2]


+<sub>E</sub>[|Yn|;|Yn|> M1]



+E[|Yn|;|Yn| ≤M1,|Xn+Yn|> M2].


The first and third terms on the right are each less than ε/4 by our choice of M1. The second and fourth


terms are each less thanM1P(|Xn+Yn|> M2)≤ε/2.


The main result we need in this section is Vitali’s convergence theorem.


Theorem 7.3. IfXn→X almost surely and theXn are uniformly integrable, thenE|Xn−X| →0.


Proof. By the above proposition, Xn−X is uniformly integrable and tends to 0 a.s., so without loss of


generality, we may assumeX = 0. Letε >0 and chooseM such that supnE[|Xn|;|Xn|> M]< ε. Then


E|Xn| ≤E[|Xn|;|Xn|> M] +E[|Xn|;|Xn| ≤M]≤ε+E[|Xn|1(|Xn|≤M)].


The second term on the right goes to 0 by dominated convergence.


8. Complements to the SLLN.


Proposition 8.1. SupposeXi is an i.i.d. sequence andE|X1|<∞. Then


E




Sn



n −EX1


→0.


Proof. Without loss of generality we may assumeEX1= 0. By the SLLN, Sn/n →0 a.s. So we need to


show that the sequenceSn/n is uniformly integrable.


PickM1such that E[|X1|;|X1|> M1]< ε/2. PickM2=M1E|X1|/ε. So


P(|Sn/n|> M2)≤E|Sn|/nM2≤E|X1|/M2=ε/M1.


We used hereE|Sn| ≤P
n


i=1E|Xi|=nE|X1|.


We then have


E[|Xi|;|Sn/n|> M2]≤E[|Xi|:|Xi|> M1] +E[|Xi|;|Xi| ≤M1,|Sn/n|> M2]


≤ε+M1P(|Sn/n|> M2)≤2ε.


Finally,


E[|Sn/n|;|Sn/n|> M2]≤


1
n



n


X


i=1


E[|Xi|;|Sn/n|> M2]≤2ε.


</div>
<span class='text_page_counter'>(13)</span><div class='page_container' data-page=13>

Theorem 8.2. LetXi be a sequence of independent random variables.,A >0, andYi=Xi1(|Xi|≤A). Then


P<sub>X</sub>


i converges if and only if all of the following three series converge: (a)PP(|Xn|> A); (b) PEYi; (c)


P<sub>Var</sub><sub>Y</sub>


i.


Proof of “if” part. Since (c) holds, thenP<sub>(Y</sub>


i−EYi) converges by Lemma 6.1. Since (b) holds, taking the


difference showsP<sub>Y</sub>


iconverges. Since (a) holds,PP(Xi6=Yi) =PP(|Xi|> A)<∞, so by Borel-Cantelli,


P(Xi6=Yi i.o.) = 0. It follows thatPXi converges.


9. Conditional expectation.



If F ⊆ G are two σ-fields and X is an integrable G measurable random variable, the conditional
expectationof X given F, written E[X | F] and read as “the expectation (or expected value) of X given
F,” is anyF measurable random variableY such thatE[Y;A] =E[X;A] for everyA∈ F. Theconditional


probabilityofA∈ G givenF is defined byP(A| F) =E[1A| F].


IfY1, Y2are twoFmeasurable random variables withE[Y1;A] =E[Y2;A] for allA∈ F, thenY1=Y2,


a.s., or conditional expectation is unique up to a.s. equivalence.


In the caseX is alreadyF measurable, E[X | F] =X. IfX is independent of F,E[X | F] =EX.
Both of these facts follow immediately from the definition. For another example, which ties this definition
with the one used in elementary probability courses, if{Ai}is a finite collection of disjoint sets whose union


is Ω,P(Ai)>0 for alli, andF is theσ-field generated by theAis, then


P(A| F) =
X


i


P(A∩Ai)


P(Ai)


1Ai.


This follows since the right-hand side isF measurable and its expectation over any setAi isP(A∩Ai).



As an example, suppose we toss a fair coin independently 5 times and let Xi be 1 or 0 depending


whether the ith toss was a heads or tails. Let A be the event that there were 5 heads and let Fi =


σ(X1, . . . , Xi). ThenP(A) = 1/32 whileP(A| F1) is equal to 1/16 on the event (X1= 1) and 0 on the event


(X1= 0). P(A| F2) is equal to 1/8 on the event (X1= 1, X2= 1) and 0 otherwise.


We have


E[E[X | F]] =EX (9.1)


because<sub>E</sub>[<sub>E</sub>[X| F]] =<sub>E</sub>[<sub>E</sub>[X | F]; Ω] =<sub>E</sub>[X; Ω] =<sub>E</sub>X.
The following is easy to establish.


Proposition 9.1. (a) IfX ≥Y are both integrable, thenE[X | F]≥E[Y | F]a.s.


(b) If X andY are integrable and a∈<sub>R</sub>, then<sub>E</sub>[aX+Y | F] =a<sub>E</sub>[X | F] +<sub>E</sub>[Y | F].


It is easy to check that limit theorems such as monotone convergence and dominated convergence
have conditional expectation versions, as do inequalities like Jensen’s and Chebyshev’s inequalities. Thus,
for example, we have the following.


Proposition 9.2. (Jensen’s inequality for conditional expectations) If g is convex and X and g(X) are
integrable,


</div>
<span class='text_page_counter'>(14)</span><div class='page_container' data-page=14>

Proposition 9.3. IfX andXY are integrable and Y is measurable with respect toF, then


E[XY | F] =YE[X | F]. (9.2)



Proof. IfA∈ F, then for anyB∈ F,


E1AE[X| F];B=EE[X| F];A∩B=E[X;A∩B] =E[1AX;B].


Since 1AE[X | F] isF measurable, this shows that (9.1) holds whenY = 1A and A∈ F. Using linearity


and taking limits shows that (9.1) holds wheneverY isF measurable andXand XY are integrable.
Two other equalities follow.


Proposition 9.4. IfE ⊆ F ⊆ G, then


EE[X | F] | E=E[X| E] =EE[X | E] | F.


Proof. The right equality holds becauseE[X | E] isE measurable, henceF measurable. To show the left
equality, letA∈ E. Then sinceA is also inF,


EEE[X | F] | E;A=EE[X | F];A=E[X;A] =E[EX | E];A.
Since both sides areE measurable, the equality follows.


To show the existence of<sub>E</sub>[X | F], we proceed as follows.
Proposition 9.5. IfX is integrable, then<sub>E</sub>[X | F]exists.


Proof. Using linearity, we need only consider X ≥0. Define a measure <sub>Q</sub> onF by <sub>Q</sub>(A) =<sub>E</sub>[X;A] for
A∈ F. This is trivially absolutely continuous with respect to <sub>P</sub>|F, the restriction of Pto F. LetE[X | F]
be the Radon-Nikodym derivative of<sub>Q</sub>with respect to<sub>P</sub>|F. The Radon-Nikodym derivative isF measurable
by construction and so provides the desired random variable.


WhenF=σ(Y), one usually writes<sub>E</sub>[X|Y] for<sub>E</sub>[X | F]. Notation that is commonly used (however,
we will use it only very occasionally and only for heuristic purposes) is <sub>E</sub>[X |Y =y]. The definition is as
follows. IfA∈σ(Y), thenA= (Y ∈B) for some Borel setB by the definition ofσ(Y), or 1A= 1B(Y). By



linearity and taking limits, ifZ is σ(Y) measurable, Z =f(Y) for some Borel measurable function f. Set
Z=E[X |Y] and choosef Borel measurable so thatZ=f(Y). ThenE[X |Y =y] is defined to bef(y).


If X ∈ L2 <sub>and</sub> <sub>M</sub> <sub>=</sub> <sub>{Y</sub> <sub>∈</sub> <sub>L</sub>2 <sub>:</sub> <sub>Y</sub> <sub>is</sub><sub>F-measurable}, one can show that</sub>


E[X | F] is equal to the
projection ofX onto the subspaceM. We will not use this in these notes.


10. Stopping times.


We next want to talk about stopping times. Suppose we have a sequence of σ-fields Fi such that


Fi ⊂ Fi+1 for each i. An example would be if Fi = σ(X1, . . . , Xi). A random mapping N from Ω to


{0,1,2, . . .}is called astopping time if for eachn, (N≤n)∈ Fn. A stopping time is also called an optional


time in the Markov theory literature.


The intuition is that the sequence knows whether N has happened by time n by looking at Fn.


Suppose some motorists are told to drive north on Highway 99 in Seattle and stop at the first motorcycle
shop past the second realtor after the city limits. So they drive north, pass the city limits, pass two realtors,
and come to the next motorcycle shop, and stop. That is a stopping time. If they are instead told to stop
at the third stop light before the city limits (and they had not been there before), they would need to drive
to the city limits, then turn around and return past three stop lights. That is not a stopping time, because
they have to go ahead of where they wanted to stop to know to stop there.


</div>
<span class='text_page_counter'>(15)</span><div class='page_container' data-page=15>

Proposition 10.1.



(a) Fixed timesnare stopping times.


(b) If N1and N2are stopping times, then so areN1∧N2andN1∨N2.


(c) IfNn is a nondecreasing sequence of stopping times, then so isN = supnNn.
(d) If Nn is a nonincreasing sequence of stopping times, then so isN = infnNn.


(e) IfN is a stopping time, then so isN+n.


We defineFN ={A:A∩(N ≤n)∈ Fn for alln}.


11. Martingales.


In this section we consider martingales. LetFn be an increasing sequence ofσ-fields. A sequence of


random variablesMn isadapted toFn if for eachn,Mn isFn measurable.


Mn is amartingaleifMn is adapted toFn, Mn is integrable for alln, and


E[Mn| Fn−1] =Mn−1, a.s., n= 2,3, . . . . (11.1)


If we have <sub>E</sub>[Mn | Fn−1]≥Mn−1 a.s. for every n, thenMn is a submartingale. If we have E[Mn |


Fn−1]≤Mn−1, we have a supermartingale. Submartingales have a tendency to increase.


Let us take a moment to look at some examples. If Xi is a sequence of mean zero i.i.d. random


variables and Sn is the partial sum process, then Mn =Sn is a martingale, sinceE[Mn | Fn−1] =Mn−1+


E[Mn−Mn−1| Fn−1] =Mn−1+E[Mn−Mn−1] = Mn−1, using independence. If theXi’s have variance



one andMn=Sn2−n, then


E[Sn2 | Fn−1] =E[(Sn−Sn−1)2| Fn−1] + 2Sn−1E[Sn| Fn−1]−S2n−1= 1 +S
2
n−1,


using independence. It follows thatMn is a martingale.


Another example is the following: ifX ∈L1 andMn=E[X | Fn], thenMn is a martingale.


IfMn is a martingale andHn∈ Fn−1 for eachn, it is easy to check thatNn=P
n


i=1Hi(Mi−Mi−1)


is also a martingale.


IfMn is a martingale and g(Mn) is integrable for eachn, then by Jensen’s inequality


E[g(Mn+1)| Fn]≥g(E[Mn+1| Fn]) =g(Mn),


or g(Mn) is a submartingale. Similarly if g is convex and nondecreasing on [0,∞) and Mn is a positive


submartingale, theng(Mn) is a submartingale because


E[g(Mn+1)| Fn]≥g(E[Mn+1 |Fn])≥g(Mn).


12. Optional stopping.



Note that if one takes expectations in (11.1), one has<sub>E</sub>Mn=EMn−1, and by inductionEMn=EM0.


The theorem about martingales that lies at the basis of all other results is Doob’s optional stopping theorem,
which says that the same is true if we replacenby a stopping timeN. There are various versions, depending
on what conditions one puts on the stopping times.


Theorem 12.1. IfN is a bounded stopping time with respect toFn andMn a martingale, then EMN =


EM0.


Proof. SinceN is bounded, letKbe the largest valueN takes. We write
EMN =


K


X


k=0


E[MN;N =k] =
K


X


k=0


</div>
<span class='text_page_counter'>(16)</span><div class='page_container' data-page=16>

Note (N =k) isFj measurable ifj ≥k, so


E[Mk;N =k] =E[Mk+1;N =k]



=E[Mk+2;N =k] =. . .=E[MK;N =k].


Hence


EMN =
K


X


k=0


E[MK;N =k] =EMK=EM0.


This completes the proof.


The assumption thatN be bounded cannot be entirely dispensed with. For example, letMn be the


partial sums of a sequence of i.i.d. random variable that take the values ±1, each with probability 1
2. If


N = min{i:Mi= 1}, we will see later on thatN <∞a.s., butEMN = 16= 0 =EM0.


The same proof as that in Theorem 12.1 gives the following corollary.


Corollary 12.2. IfN is bounded byKand Mn is a submartingale, thenEMN ≤EMK.


Also the same proof gives


Corollary 12.3. IfN is bounded byK,A∈ FN, andMnis a submartingale, thenE[MN;A]≤E[MK;A].



Proposition 12.4. IfN1 ≤N2 are stopping times bounded by K and M is a martingale, thenE[MN2 |


FN1] =MN1, a.s.


Proof. SupposeA∈ FN1. We need to showE[MN1;A] =E[MN2;A]. Define a new stopping time N3by


N3(ω) =




N1(ω) ifω∈A


N2(ω) ifω /∈A.


It is easy to check thatN3 is a stopping time, soEMN3 =EMK =EMN2 implies


E[MN1;A] +E[MN2;A


c<sub>] =</sub>


E[MN2].


Subtracting<sub>E</sub>[MN2;A


c<sub>] from each side completes the proof.</sub>


The following is known as the Doob decomposition.


Proposition 12.5. SupposeXk is a submartingale with respect to an increasing sequence of σ-fields Fk.
Then we can writeXk=Mk+Ak such thatMk is a martingale adapted to theFk andAk is a sequence of


random variables withAk being Fk−1-measurable andA0≤A1≤ · · ·.


Proof. Letak =E[Xk | Fk−1]−Xk−1fork= 1,2, . . .SinceXk is a submartingale, then eachak≥0. Then


letAk =P
k


i=1ai. The fact that the Ak are increasing and measurable with respect to Fk−1 is clear. Set


Mk =Xk−Ak. Then


E[Mk+1−Mk| Fk] =E[Xk+1−Xk|Fk]−ak+1= 0,


orMk is a martingale.


</div>
<span class='text_page_counter'>(17)</span><div class='page_container' data-page=17>

Corollary 12.6. SupposeXk is a submartingale, andN1≤N2 are bounded stopping times. Then


E[XN2 | FN1]≥XN1.


13. Doob’s inequalities.


The first interesting consequences of the optional stopping theorems are Doob’s inequalities. IfMn


is a martingale, denoteM∗


n = maxi≤n|Mi|.


Theorem 13.1. IfMn is a martingale or a positive submartingale,


P(Mn∗≥a)≤E[|Mn|;Mn∗≥a]/a≤E|Mn|/a.



Proof. SetMn+1=Mn. LetN = min{j:|Mj| ≥a} ∧(n+ 1). Since| · |is convex,|Mn|is a submartingale.


IfA= (M<sub>n</sub>∗≥a), thenA∈ FN and by Corollary 12.3


a<sub>P</sub>(M<sub>n</sub>∗≥a)≤<sub>E</sub>[|MN|;A]≤E[|Mn|;A]≤E|Mn|.


Forp >1, we have the following inequality.


Theorem 13.2. Ifp >1 andE|Mi|p<∞fori≤n, then


E(Mn∗)


p<sub>≤</sub> p


p−1
p


E|Mn|p.


Proof. NoteM<sub>n</sub>∗≤Pn


i=1|Mn|, henceMn∗∈Lp. We write


E(Mn∗)
p<sub>=</sub>Z




0



pap−1<sub>P</sub>(M<sub>n</sub>∗> a)da≤
Z ∞


0


pap−1<sub>E</sub>[|Mn|1(M∗


n≥a)/a]da


=E
Z Mn∗


0


pap−2|Mn|da=


p


p−1E[(M


n)
p−1


|Mn|]


≤ p


p−1(E(M




n)p)(p−1)/p(E|Mn|p)1/p.


The last inequality follows by Hăolders inequality. Now divide both sides by the quantity (E(Mn∗)p)(p−1)/p.


14. Martingale convergence theorems.


The martingale convergence theorems are another set of important consequences of optional stopping.
The main step is the upcrossing lemma. The number of upcrossings of an interval [a, b] is the number of
times a process crosses from belowato aboveb.


To be more exact, let


S1= min{k:Xk ≤a}, T1= min{k > S1:Xk≥b},


and


Si+1= min{k > Ti :Xk ≤a}, Ti+1= min{k > Si+1:Xk≥b}.


</div>
<span class='text_page_counter'>(18)</span><div class='page_container' data-page=18>

Theorem 14.1. (Upcrossing lemma)IfXk is a submartingale,


EUn≤(b−a)−1E[(Xn−a)+].


Proof. The number of upcrossings of [a, b] byXk is the same as the number of upcrossings of [0, b−a]


byYk = (Xk−a)+. MoreoverYk is still a submartingale. If we obtain the inequality for the the number of


upcrossings of the interval [0, b−a] by the processYk, we will have the desired inequality for upcrossings of



X.


So we may assumea= 0. Fixnand defineYn+1=Yn. This will still be a submartingale. Define the


Si,Ti as above, and letS0i=Si∧(n+ 1), Ti0=Ti∧(n+ 1). SinceTi+1> Si+1> Ti, thenTn+10 =n+ 1.


We write


EYn+1=EYS0
1+


n+1


X


i=0


E[YT0
i −YS


0
i] +


n+1


X


i=0


E[YS0


i+1−YT


0
i].


All the summands in the third term on the right are nonnegative sinceYk is a submartingale. For thejth


upcrossing,YT0
j−YS


0


j ≥b−a, whileYT
0
j −YS


0


j is always greater than or equal to 0. So



X


i=0


(YT0


i −YSi0)≥(b−a)Un.


So



(4.8) <sub>E</sub>Un≤EYn+1/(b−a).


This leads to the martingale convergence theorem.


Theorem 14.2. IfXn is a submartingale such thatsupnEXn+<∞, thenXn converges a.s. asn→ ∞.


Proof. LetU(a, b) = limn→∞Un. For each a, brational, by monotone convergence,


EU(a, b)≤c(b−a)−1E(Xn−a)+<∞.


SoU(a, b)<∞, a.s. Taking the union over all pairs of rationals a, b, we see that a.s. the sequenceXn(ω)


cannot have lim supXn > lim infXn. Therefore Xn converges a.s., although we still have to rule out the


possibility of the limit being infinite. SinceXn is a submartingale,EXn≥EX0, and thus


E|Xn|=EXn++EXn−= 2EX
+


n −EXn ≤2EXn+−EX0.


By Fatou’s lemma,Elimn|Xn| ≤supnE|Xn|<∞, orXn converges a.s. to a finite limit.


Corollary 14.3. IfXnis a positive supermartingale or a martingale bounded above or below,Xnconverges
a.s.


Proof. If Xn is a positive supermartingale, −Xn is a submartingale bounded above by 0. Now apply


</div>
<span class='text_page_counter'>(19)</span><div class='page_container' data-page=19>

If Xn is a martingale bounded above, by considering −Xn, we may assume Xn is bounded below.



Looking atXn+M for fixedM will not affect the convergence, so we may assumeXn is bounded below by


0. Now apply the first assertion of the corollary.


Proposition 14.4. IfXn is a martingale withsupnE|Xn|p<∞for somep >1, then the convergence is in


Lp <sub>as well as a.s. This is also true when</sub><sub>X</sub>


n is a submartingale. If Xn is a uniformly integrable martingale,
then the convergence is inL1<sub>. If</sub><sub>X</sub>


n→X∞in L1, thenXn=E[X∞| Fn].


Xn is a uniformly integrable martingale if the collection of random variables Xn is uniformly


inte-grable.


Proof. TheLp <sub>convergence assertion follows by using Doob’s inequality (Theorem 13.2) and dominated</sub>


convergence. TheL1<sub>convergence assertion follows since a.s. convergence together with uniform integrability</sub>


impliesL1 <sub>convergence. Finally, if</sub><sub>j < n, we have</sub><sub>X</sub>


j =E[Xn| Fj]. IfA∈ Fj,


E[Xj;A] =E[Xn;A]→E[X∞;A]
by theL1<sub>convergence of</sub><sub>X</sub>


n toX∞. Since this is true for allA∈ Fj,Xj=E[X∞| Fj].



15. Applications of martingales.


LetSn be your fortune at timen. In a fair casino,E[Sn+1 | Fn] =Sn. IfN is a stopping time, the


optional stopping theorem says that ESN =ES0; in other words, no matter what stopping time you use


and what method of betting, you will do not better on average than ending up with what you started with.
An elegant application of martingales is a proof of the SLLN. Fix N large. Let Yi be i.i.d. Let


Zn = E[Y1 | Sn, Sn+1, . . . , SN]. We claim Zn = Sn/n. Certainly Sn/n is σ(Sn,· · ·, SN) measurable. If


A∈σ(Sn, . . . , SN) for somen, thenA= ((Sn, . . . , SN)∈B) for some Borel subsetB ofRN−n+1. Since the
Yi are i.i.d., for eachk≤n,


E[Y1; (Sn, . . . , SN)∈B] =E[Yk; (Sn, . . . , SN)∈B].


Summing overk and dividing byn,


E[Y1; (Sn, . . . , SN)∈B] =E[Sn/n; (Sn, . . . , SN)∈B].


ThereforeE[Y1;A] =E[Sn/n;A] for everyA∈σ(Sn, . . . , SN). ThusZn =Sn/n.


Let Xk = ZN−k, and let Fk = σ(SN−k, SN−k+1, . . . , SN). Note Fk gets larger as k gets larger,


and by the above Xk = E[Y1 | Fk]. This shows that Xk is a martingale (cf. the next to last example


in Section 11). By Doob’s upcrossing inequality, if U<sub>n</sub>X is the number of upcrossings of [a, b] by X, then
EUNX ≤EX



+


N/(b−a) ≤E|Z0|/(b−a) =E|Y1|/(b−a). This differs by at most one from the number of


upcrossings of [a, b] by Z1, . . . , ZN. So the expected number of upcrossings of [a, b] by Zk for k ≤ N is


bounded by 1 +E|Y1|/(b−a). Now letN → ∞. By Fatou’s lemma, the expected number of upcrossings


of [a, b] byZ1, . . . is finite. Arguing as in the proof of the martingale convergence theorem, this says that


Zn=Sn/ndoes not oscillate.


It is conceivable that|Sn/n| → ∞. But by Fatou’s lemma,


E[lim|Sn/n|]≤lim infE|Sn/n| ≤lim infnE|Y1|/n=E|Y1|<∞.


</div>
<span class='text_page_counter'>(20)</span><div class='page_container' data-page=20>

Proposition 15.1. Suppose theYi are i.i.d. withE|Y1|<∞,N is a stopping time withEN <∞, andN


is independent of theYi. ThenESN = (EN)(EY1), where theSn are the partial sums of theYi.


Proof. Sn−n(EY1) is a martingale, soESn∧N =E(n∧N)EY1 by optional stopping. The right hand side


tends to (EN)(EY1) by monotone convergence. Sn∧N converges almost surely toSN, and we need to show


the expected values converge.
Note


|Sn∧N|=




X


k=0


|Sn∧k|1(N=k)≤



X


k=0
n∧k


X


j=0


|Yj|1(N=k)


=


n


X


j=0



X


k>j



|Yj|1(N=k)=
n


X


j=0


|Yj|1(N≥j)≤



X


j=0


|Yj|1(N≥j).


The last expression, using the independence, has expected value


X


j=0


(E|Yj|)P(N≥j)≤(E|Y1|)(1 +EN)<∞.


So by dominated convergence, we haveESn∧N →ESN.


Wald’s second identity is a similar expression for the variance ofSN.



We can use martingales to find certain hitting probabilities.


Proposition 15.2. Suppose theYi are i.i.d. withP(Y1= 1) = 1/2,P(Y1=−1) = 1/2, andSn the partial
sum process. Supposeaandbare positive integers. Then


P(Sn hits−abeforeb) =


b
a+b.


IfN = min{n:Sn ∈ {−a, b}}, thenEN =ab.


Proof. S2


n−nis a martingale, soESn2∧N =En∧N. Let n→ ∞. The right hand side converges to EN


by monotone convergence. Since Sn∧N is bounded in absolute value by a+b, the left hand side converges


by dominated convergence toESN2, which is finite. SoEN is finite, henceN is finite almost surely.


Sn is a martingale, soESn∧N =ES0= 0. By dominated convergence, and the fact thatN <∞a.s.,


henceSn∧N →SN, we haveESN = 0, or


−aP(SN =−a) +bP(SN =b) = 0.


We also have


P(SN =−a) +P(SN =b) = 1.



Solving these two equations for P(SN =−a) andP(SN =b) yields our first result. Since EN = ESN2 =


a2P(SN =−a) +b2P(SN =b), substituting gives the second result.


Based on this proposition, if we let a → ∞, we see that P(Nb < ∞) = 1 and ENb = ∞, where


Nb= min{n:Sn=b}.


</div>
<span class='text_page_counter'>(21)</span><div class='page_container' data-page=21>

Proposition 15.3. SupposeAn∈ Fn. Then(An i.o.)and(P∞n=1P(An| Fn−1) =∞)differ by a null set.


Proof. Let Xn = Pn<sub>m=1</sub>[1Am −P(Am | Fm−1)]. Note |Xn−Xn−1| ≤ 1. Also, it is easy to see that


E[Xn−Xn−1| Fn−1] = 0, soXn is a martingale.


We claim that for almost every ω either limXn exists and is finite, or else lim supXn = ∞ and


lim infXn = −∞. In fact, if N = min{n : Xn ≤ −k}, then Xn∧N ≥ −k−1, so Xn∧N converges by the


martingale convergence theorem. Therefore limXn exists and is finite on (N =∞). So if limXn does not


exist or is not finite, thenN <∞. This is true for allk, hence lim infXn =−∞. A similar argument shows


lim supXn=∞in this case.


Now if limXn exists and is finite, then


P∞


n=11An =∞if and only if



P


P(An | Fn−1)<∞. On the


other hand, if the limit does not exist or is not finite, thenP<sub>1</sub>


An =∞and


P


P(An| Fn−1) =∞.


16. Weak convergence.


We will see later that if theXi are i.i.d. with mean zero and variance one, thenSn/




n converges in
the sense


P(Sn/




n∈[a, b])→P(Z∈[a, b]),
where Z is a standard normal. If Sn/





n converged in probability or almost surely, then by the zero-one
law it would converge to a constant, contradicting the above. We want to generalize the above type of
convergence.


We say Fn converges weakly to F if Fn(x) → F(x) for all x at which F is continuous. Here Fn


and F are distribution functions. We say Xn converges weakly to X ifFXn converges weakly to FX. We


sometimes sayXn converges in distributionor converges in law toX. Probabilities µn converge weakly if


their corresponding distribution functions converges, that is, ifFµn(x) =µn(−∞, x] converges weakly.


An example that illustrates why we restrict the convergence to continuity points ofF is the following.
LetXn = 1/nwith probability one, andX = 0 with probability one. FXn(x) is 0 ifx <1/nand 1 otherwise.


FXn(x) converges toFX(x) for all xexceptx= 0.


Proposition 16.1. Xn converges weakly to X if and only if Eg(Xn) → Eg(X) for all g bounded and


continuous.


The idea that <sub>E</sub>g(Xn) converges to Eg(X) for all g bounded and continuous makes sense for any
metric space and is used as a definition of weak convergence forXn taking values in general metric spaces.


Proof. First supposeEg(Xn) converges toEg(X). Letxbe a continuity point of F, letε >0, and choose
δsuch that|F(y)−F(x)|< εif|y−x|< δ. Choosegcontinuous such thatg is one on (−∞, x], takes values
between 0 and 1, and is 0 on [x+δ,∞). ThenFXn(x)≤Eg(Xn)→Eg(X)≤FX(x+δ)≤F(x) +ε.


Similarly, ifhis a continuous function taking values between 0 and 1 that is 1 on (−∞, x−δ] and 0
on [x,∞),FXn(x)≥Eh(Xn)→Eh(X)≥FX(x−δ)≥F(x)−ε. Sinceεis arbitrary,FXn(x)→FX(x).



Now suppose Xn converges weakly to X. If aand b are continuity points of F and of all theFXn,


then<sub>E</sub>1[a,b](Xn) =FXn(b)−FXn(a)→F(b)−F(a) =E1[a,b](X). By taking linear combinations, we have


Eg(Xn)→Eg(X) for everyg which is a step function where the end points of the intervals are continuity
points for all the FXn and forFX. Since the set of points that are not a continuity point for some FXn or


forFX is countable, and we can approximate any continuous function on an interval by such step functions


uniformly, we have<sub>E</sub>g(Xn)→Eg(X) for allgsuch that the support ofgis a closed interval whose endpoints
are continuity points ofFX andg is continuous on its support.


Letε >0 and choose M such thatFX(M)>1−ε and FX(−M)< ε and so thatM and −M are


</div>
<span class='text_page_counter'>(22)</span><div class='page_container' data-page=22>

whereg is a bounded continuous function. The difference betweenE(1[−M,M]g)(X) andEg(X) is bounded
by kgk∞P(X /∈ [−M, M]) ≤2εkgk∞. Similarly, whenX is replaced by Xn, the difference is bounded by


kgk∞P(Xn∈/[−M, M])→ kgk∞P(X /∈[−M, M]). So fornlarge, it is less than 3εkgk∞. Sinceεis arbitrary,


Eg(Xn)→Eg(X) wheneverg is bounded and continuous.


Let us examine the relationship between weak convergence and convergence in probability. The
example ofSn/




nshows that one can have weak convergence without convergence in probability.
Proposition 16.2. (a) IfXn converges to X in probability, then it converges weakly.



(b) If Xn converges weakly to a constant, it converges in probability.


(c) (Slutsky’s theorem) If Xn converges weakly to X and Yn converges weakly to a constant c, then


Xn+Yn converges weakly toX+candXnYn converges weakly tocX.


Proof. To prove (a), let g be a bounded and continuous function. If nj is any subsequence, then there


exists a further subsequence such thatX(njk) converges almost surely toX. Then by dominated convergence,


Eg(X(njk))→Eg(X). That suffices to showEg(Xn) converges toEg(X).


For (b), ifXn converges weakly toc,


P(Xn−c > ε) =P(Xn > c+ε) = 1−P(Xn≤c+ε)→1−P(c≤c+ε) = 0.


We use the fact that if Y ≡ c, then c+ε is a point of continuity for FY. A similar equation shows


P(Xn−c≤ −ε)→0, soP(|Xn−c|> ε)→0.


We now prove the first part of (c), leaving the second part for the reader. Letxbe a point such that
x−cis a continuity point ofFX. Chooseεso thatx−c+εis again a continuity point. Then


P(Xn+Yn ≤x)≤P(Xn+c≤x+ε) +P(|Yn−c|> ε)→P(X ≤x−c+ε).


So lim sup<sub>P</sub>(Xn+Yn ≤x)≤P(X+c≤x+ε). Sinceεcan be as small as we like andx−c is a continuity
point ofFX, then lim supP(Xn+Yn≤x)≤P(X+c≤x). The lim inf is done similarly.


We say a sequence of distribution functions{Fn}is tightif for each ε >0 there existsM such that



Fn(M)≥ 1−ε and Fn(−M) ≤ ε for alln. A sequence of r.v.s is tight if the corresponding distribution


functions are tight; this is equivalent toP(|Xn| ≥M)≤ε.


Theorem 16.3. (Helly’s theorem)LetFn be a sequence of distribution functions that is tight. There exists
a subsequencenj and a distribution functionF such thatFnj converges weakly toF.


What could happen is thatXn=n, so thatFXn →0; the tightness precludes this.


Proof. Let qk be an enumeration of the rationals. Since Fn(qk) ∈ [0,1], any subsequence has a further


subsequence that converges. Use diagonalization so that Fnj(qk) converges for each qk and call the limit


F(qk). F is nondecreasing, and defineF(x) = infqk≥xF(qk). So F is right continuous and nondecreasing.


Ifxis a point of continuity ofF andε >0, then there existrandsrational such thatr < x < sand
F(s)−ε < F(x)< F(r) +ε. Then


Fnj(x)≥Fnj(r)→F(r)> F(x)−ε


and


Fnj(x)≤Fnj(s)→F(s)< F(x) +ε.


Sinceεis arbitrary,Fnj(x)→F(x).


Since the Fn are tight, there exists M such that Fn(−M) < ε. Then F(−M) ≤ ε, which implies


</div>
<span class='text_page_counter'>(23)</span><div class='page_container' data-page=23>

Proposition 16.4. Suppose there exists ϕ: [0,∞)→[0,∞)that is increasing andϕ(x)→ ∞as x→ ∞.
Ifc= sup<sub>n</sub><sub>E</sub>ϕ(|Xn|)<∞, then theXn are tight.



Proof. Letε >0. ChooseM such thatϕ(x)≥c/εifx > M. Then
P(|Xn|> M)≤


Z <sub>ϕ(|X</sub>


n|)


c/ε 1(|Xn|>M)dP≤


ε


cEϕ(|Xn|)≤ε.


17. Characteristic functions.


We define thecharacteristic function of a random variableX byϕX(t) =Eeitxfort∈R.


Note thatϕX(t) =R eitxPX(dx). So ifX andY have the same law, they have the same characteristic


function. Also, if the law of X has a density, that is,PX(dx) =fX(x)dx, thenϕX(t) =ReitxfX(x)dx, so


in this case the characteristic function is the same as (one definition of) the Fourier transform offX.


Proposition 17.1. ϕ(0) = 1,|ϕ(t)| ≤1,ϕ(−t) =ϕ(t), andϕis uniformly continuous.


Proof. Since|eitx<sub>| ≤</sub><sub>1, everything follows immediately from the definitions except the uniform continuity.</sub>


For that we write



|ϕ(t+h)−ϕ(t)|=|Eei(t+h)X−EeitX| ≤E|eitX(eihX−1)|=E|eihX−1|.


|eihX<sub>−</sub><sub>1|</sub> <sub>tends to 0 almost surely as</sub><sub>h</sub><sub>→</sub><sub>0, so the right hand side tends to 0 by dominated convergence.</sub>


Note that the right hand side is independent oft.


Proposition 17.2. ϕaX(t) =ϕX(at)andϕX+b(t) =eitbϕX(t),


Proof. The first follows fromEeit(aX)=Eei(at)X, and the second is similar.
Proposition 17.3. IfX andY are independent, thenϕX+Y(t) =ϕX(t)ϕY(t).


Proof. From the multiplication theorem,


Eeit(X+Y)=EeitXeitY =EeitXEeitY.


Note that ifX1 andX2are independent and identically distributed, then


ϕX1−X2(t) =ϕX1(t)ϕ−X2(t) =ϕX1(t)ϕX2(−t) =ϕX1(t)ϕX2(t) =|ϕX1(t)|


2<sub>.</sub>


Let us look at some examples of characteristic functions.


(a) Bernoulli: By direct computation, this is peit+ (1−p) = 1−p(1−eit).
(b) Coin flip: (i.e.,P(X = +1) =P(X =−1) = 1/2) We have 12e


it<sub>+</sub>1
2e


−it<sub>= cos</sub><sub>t.</sub>



(c) Poisson:


EeitX=

X


k=0


eitke−λλ


k


k! =e


−λX(λeit)k
k! =e


−λ<sub>e</sub>λeit


=eλ(eit−1).


</div>
<span class='text_page_counter'>(24)</span><div class='page_container' data-page=24>

(e) Binomial: WriteX as the sum ofnindependent Bernoulli r.v.sBi. So


ϕX(t) =
n


Y


i=1



ϕBi(t) = [ϕBi(t)]


n


= [1−p(1−eit)]n.


(f) Geometric:


ϕ(t) =

X


k=0


p(1−p)keitk=pX((1−p)eit)k = p
1−(1−p)eit.


(g) Uniform on[a, b]:


ϕ(t) = 1
b−a


Z b


a


eitxdx= e


itb<sub>−</sub><sub>e</sub>ita



(b−a)it .
Note that whena=−bthis reduces to sin(bt)/bt.


(h) Exponential:


Z ∞


0


λeitxe−λxdx=λ
Z ∞


0


e(it−λ)xdx= λ
λ−it.
(i) Standard normal:


ϕ(t) = √1


Z ∞
−∞


eitxe−x2/2dx.


This can be done by completing the square and then doing a contour integration. Alternately,ϕ0(t) =
(1/√2π)R<sub>−∞</sub>∞ ixeitxe−x2/2dx. (do the real and imaginary parts separately, and use the dominated
convergence theorem to justify taking the derivative inside.) Integrating by parts (do the real and


imaginary parts separately), this is equal to−tϕ(t). The only solution toϕ0<sub>(t) =</sub><sub>−tϕ(t) with</sub><sub>ϕ(0) = 1</sub>
isϕ(t) =e−t2<sub>/2</sub>


.


(j) Normal with mean µ and variance σ2<sub>: Writing</sub> <sub>X</sub> <sub>=</sub><sub>σZ</sub> <sub>+</sub><sub>µ, where</sub> <sub>Z</sub> <sub>is a standard normal, then</sub>


ϕX(t) =eiµtϕZ(σt) =eiµt−σ


2<sub>t</sub>2<sub>/2</sub>


.
(k) Cauchy: We have


ϕ(t) = 1
π


Z <sub>e</sub>itx


1 +x2dx.


This is a standard exercise in contour integration in complex analysis. The answer ise−|t|<sub>.</sub>


18. Inversion formula.


We need a preliminary real variable lemma, and then we can proceed to the inversion formula, which
gives a formula for the distribution function in terms of the characteristic function.


Lemma 18.1. (a)RN



0 (sin(Ax)/x)dx→sgn (A)π/2 asN → ∞.
(b) sup<sub>a</sub>|Ra


0(sin(Ax)/x)dx|<∞.


</div>
<span class='text_page_counter'>(25)</span><div class='page_container' data-page=25>

An alternate proof of (a) is the following. e−xysinxis integrable on{(x, y); 0< x < a,0< y <∞}.
So


Z a


0


sinx
x dx=


Z a


0


Z ∞


0


e−xysinx dy dx


=
Z ∞


0



Z a


0


e−xysinx dx dy


=
Z ∞


0


h e−xy


y2<sub>+ 1</sub>(−ysinx−cosx)


ia
0
dy
=
Z ∞
0
h
e
−ay


y2<sub>+ 1</sub>(−ysina−cosa)


o
− −1



y2<sub>+ 1</sub>


i
dy


= π
2 −sina


Z ∞


0


ye−ay


y2<sub>+ 1</sub>dy−cosa


Z ∞


0


e−ay
y2<sub>+ 1</sub>dy.


The last two integrals tend to 0 asa→ ∞since the integrand is bounded by (1 +y)e−y ifa≥1.
Theorem 18.2. (Inversion formula)Let µbe a probability measure and let ϕ(t) =R


eitx<sub>µ(dx)</sub><sub>. If</sub> <sub>a < b</sub><sub>,</sub>
then
lim
T→∞


1

Z T
−T


e−ita<sub>−</sub><sub>e</sub>−itb


it ϕ(t)dt=µ(a, b) +
1


2µ({a}) +
1
2µ({b}).


The example whereµis point mass at 0, soϕ(t) = 1, shows that one needs to take a limit, since the
integrand in this case is 2 sint/t, which is not integrable.


Proof. By Fubini,


Z T


−T


e−ita<sub>−</sub><sub>e</sub>−itb


it ϕ(t)dt=
Z T


−T



Z <sub>e</sub>−ita<sub>−</sub><sub>e</sub>−itb


it e


itx<sub>µ(dx)</sub><sub>dt</sub>


=
Z Z T


−T


e−ita<sub>−</sub><sub>e</sub>−itb


it e


itx<sub>dt µ(dx).</sub>


To justify this, we bound the integrand by the mean value theorem


Expandinge−itb <sub>and</sub><sub>e</sub>−ita <sub>using Euler’s formula, and using the fact that cos is an even function and</sub>


sin is odd, we are left with
Z


2h
Z T


0


sin(t(x−a))


t dt−


Z T


0


sin(t(x−b))


t dt


i
µ(dx).
Using Lemma 18.1 and dominated convergence, this tends to


Z


[πsgn (x−a)−πsgn (x−b)]µ(dx).


Theorem 18.3. IfR|ϕ(t)|dt <∞, thenµhas a bounded densityf and


f(y) = 1


Z


e−ityϕ(t)dt.


Proof.


µ(a, b) +1



2µ({a}) +
1
2µ({b})
= lim
T→∞
1

Z T
−T


e−ita<sub>−</sub><sub>e</sub>−itb


it ϕ(t)dt
= 1



Z ∞


−∞


e−ita−e−itb
it ϕ(t)dt
≤b−a



Z


</div>
<span class='text_page_counter'>(26)</span><div class='page_container' data-page=26>

Lettingb→ashows thatµhas no point masses.
We now write



µ(x, x+h) = 1


Z <sub>e</sub>−itx<sub>−</sub><sub>e</sub>−it(x+h)


it ϕ(t)dt
= 1




Z <sub></sub>Z x+h


x


e−itydyϕ(t)dt


=
Z x+h


x


1


Z


e−ityϕ(y)dtdy.


Soµhas density (1/2π)R



e−ity<sub>ϕ(t)</sub><sub>dt. As in the proof of Proposition 17.1, we see</sub><sub>f</sub> <sub>is continuous.</sub>


A corollary to the inversion formula is the uniqueness theorem.
Theorem 18.3. IfϕX=ϕY, thenPX =PY.


The following proposition can be proved directly, but the proof using characteristic functions is much
easier.


Proposition 18.4. (a) IfX andY are independent,X is a normal with meanaand varianceb2<sub>, and</sub><sub>Y</sub> <sub>is</sub>
a normal with meancand varianced2<sub>, then</sub><sub>X</sub><sub>+</sub><sub>Y</sub> <sub>is normal with mean</sub><sub>a</sub><sub>+</sub><sub>c</sub> <sub>and variance</sub><sub>b</sub>2<sub>+</sub><sub>d</sub>2<sub>.</sub>


(b) If X andY are independent,X is Poisson with parameterλ1, andY is Poisson with parameterλ2,
thenX+Y is Poisson with parameterλ1+λ2.


(c) IfXi are i.i.d. Cauchy, thenSn/nis Cauchy.


Proof. For (a),


ϕX+Y(t) =ϕX(t)ϕY(t) =eiat−b


2<sub>t</sub>2<sub>/2</sub>


eict−c2t2/2=ei(a+c)t−(b2+d2)t2/2.
Now use the uniqueness theorem.


Parts (b) and (c) are proved similarly.


19. Continuity theorem.



Lemma 19.1. Supposeϕis the characteristic function of a probabilityµ. Then


µ([−2A,2A])≥A




Z 1/A


−1/A


ϕ(t)dt


−1.


Proof. Note


1
2T


Z T


−T


ϕ(t)dt= 1
2T


Z T



−T


Z


eitxµ(dx)dt


=


Z Z <sub>1</sub>


2T1[−T ,T](t)e


itx<sub>dtµ(dx)</sub>


=


Z <sub>sin</sub><sub>T x</sub>
T x µ(dx).


Since|(sin(T x))/T x| ≤1 and|sin(T x)| ≤1, then|(sin(T x))/T x| ≤1/2T Aif|x| ≥2A. So





Z <sub>sin</sub><sub>T x</sub>
T x µ(dx)






≤µ([−2A,2A]) +
Z


[−2A,2A]c


1


2T Aµ(dx)
=µ([−2A,2A]) + 1


2T A(1−µ([−2A,2A])


= 1


2T A+


1− 1
2T A




</div>
<span class='text_page_counter'>(27)</span><div class='page_container' data-page=27>

SettingT = 1/A,






A


2


Z 1/A


−1/A


ϕ(t)dt


1
2+


1


2µ([−2A,2A]).
Now multiply both sides by 2.


Proposition 19.2. Ifµn converges weakly toµ, thenϕn converges toϕuniformly on every finite interval.


Proof. Let ε > 0 and choose M large so that µ([−M, M]c<sub>)</sub> <sub>< ε. Define</sub> <sub>f</sub> <sub>to be 1 on [−M, M</sub><sub>], 0 on</sub>


[−M−1, M+ 1]c, and linear in between. SinceR


f dµn→R f dµ, then ifnis large enough,


Z


(1−f)dµn≤2ε.


We have



|ϕn(t+h)−ϕn(t)| ≤


Z


|eihx<sub>−</sub><sub>1|µ</sub>
n(dx)


≤2
Z


(1−f)dµn+h


Z


|x|f(x)µn(dx)


≤2ε+h(M + 1).
So fornlarge enough and|h| ≤ε/(M+ 1), we have


|ϕn(t+h)−ϕn(t)| ≤3ε,


which says that theϕn are equicontinuous. Therefore the convergence is uniform on finite intervals.


The interesting result of this section is the converse, L´evy’s continuity theorem.


Theorem 19.3. Suppose µn are probabilities, ϕn(t) converges to a function ϕ(t) for each t, and ϕ is
continuous at 0. Thenϕis the characteristic function of a probabilityµandµn converges weakly toµ.


Proof. Letε >0. Sinceϕis continuous at 0, chooseδsmall so that






1


Z δ


−δ


ϕ(t)dt−1
< ε.
Using the dominated convergence theorem, chooseN such that


1


Z δ


−δ


|ϕn(t)−ϕ(t)|dt < ε


ifn≥N. So ifn≥N,




1




Z δ


−δ


ϕn(t)dt










1


Z δ


−δ


ϕ(t)dt


1


Z δ



−δ


|ϕn(t)−ϕ(t)|dt


≥1−2ε.
By Lemma 19.1 withA= 1/δ, for suchn,


µn[−2/δ,2/δ]≥2(1−2ε)−1 = 1−4ε.


This shows theµn are tight.


Let nj be a subsequence such that µnj converges weakly, say to µ. Then ϕnj(t) → ϕµ(t), hence


ϕ(t) = ϕµ(t), or ϕ is the characteristic function of a probability µ. If µ0 is any subsequential weak limit


point ofµn, thenϕµ0(t) =ϕ(t) =ϕ<sub>µ</sub>(t); soµ0 must equal µ. Henceµ<sub>n</sub> converges weakly toµ.


</div>
<span class='text_page_counter'>(28)</span><div class='page_container' data-page=28>

Proposition 19.4. IfE|X|k <∞for an integerk, thenϕX has a continuous derivative of orderk and


ϕ(k)(t) =
Z


(ix)keitxPX(dx).


In particular,ϕ(k)(0) =ikEXk.


Proof. Write


ϕ(t+h)−ϕ(t)



h =


Z <sub>e</sub>i(t+h)x<sub>−</sub><sub>e</sub>itx


h P(dx).
The integrand is bounded by|x|. So if R


|x|PX(dx)<∞, we can use dominated convergence to obtain the


desired formula forϕ0(t). As in the proof of Proposition 17.1, we seeϕ0(t) is continuous. We do the case of
generalkby induction. Evaluatingϕ(k) <sub>at 0 gives the particular case.</sub>


Here is a converse.


Proposition 19.5. Ifϕis the characteristic function of a random variableXandϕ00(0)exists, thenE|X|2<
∞.


Proof. Note


eihx−2 +e−ihx
h2 =−2


1−coshx
h2 ≤0


and 2(1−coshx)/h2<sub>converges to</sub> <sub>x</sub>2 <sub>as</sub><sub>h</sub><sub>→</sub><sub>0. So by Fatou’s lemma,</sub>


Z



x2<sub>P</sub>X(dx)≤2 lim inf
h→0


Z <sub>1</sub><sub>−</sub><sub>cos</sub><sub>hx</sub>


h2 PX(dx)


=−lim sup


h→0


ϕ(h)−2ϕ(0) +ϕ(−h)


h2 =ϕ


00<sub>(0)</sub><sub><</sub><sub>∞.</sub>


One nice application of the continuity theorem is a proof of the weak law of large numbers. Its proof
is very similar to the proof of the central limit theorem, which we give in the next section.


Another nice use of characteristic functions and martingales is the following.


Proposition 19.6. Suppose Xi is a sequence of independent r.v.s and Sn converges weakly. Then Sn
converges almost surely.


Proof. SupposeSnconverges weakly toW. ThenϕSn(t)→ϕW(t) uniformly on compact sets by Proposition


19.2. SinceϕW(0) = 1 andϕW is continuous, there existsδsuch that|ϕW(t)−1|<1/2 if|t|< δ. So forn


large,ϕSn(t)| ≥1/4 if|t|< δ.



Note
E


h


eitSn<sub>|</sub><sub>X</sub>


1, . . . Xn−1


i


=eitSn−1


E[eitXn|X1, . . . , Xn−1] =eitSn−1ϕXn(t).


SinceϕSn(t) =


Q<sub>ϕ</sub>


Xi(t), it follows thate


itSn<sub>/ϕ</sub>


Sn(t) is a martingale.


Therefore for|t|< δand nlarge,eitSn<sub>/ϕ</sub>


Sn(t) is a bounded martingale, and hence converges almost



surely. SinceϕSn(t)→ϕW(t)6= 0, then e


itSn <sub>converges almost surely if</sub><sub>|t|</sub><sub>< δ.</sub>


Let A = {(ω, t) ∈ Ω×(−δ, δ) : eitSn(ω) <sub>does not converge}. For each</sub> <sub>t, we have almost sure </sub>


con-vergence, soR


1A(ω, t)P(dω) = 0. Therefore R<sub>−</sub>δδ


R


1AdPdt= 0, and by Fubini, R R<sub>−</sub>δδ1Adt dP= 0. Hence


almost surely,R


1A(ω, t)dt= 0. This means, there exists a setN withP(N) = 0, and ifω /∈N, theneitSn(ω)


</div>
<span class='text_page_counter'>(29)</span><div class='page_container' data-page=29>

Ifω /∈N, by dominated convergence,R<sub>0</sub>aeitSn(ω)<sub>dt</sub><sub>converges, provided</sub><sub>a < δ. Call the limit</sub><sub>A</sub>


a. Also


Z a


0


eitSn(ω)<sub>dt</sub><sub>=</sub>e


iaSn(ω)<sub>−</sub><sub>1</sub>



iSn(ω)


ifSn(ω)6= 0 and equalsaotherwise.


Since Sn converges weakly, it is not possible for |Sn| → ∞ with positive probability. If we let


N0 <sub>=</sub> <sub>{ω</sub> <sub>:</sub> <sub>|S</sub>


n(ω)| → ∞} and choose ω /∈ N ∪N0, there exists a subsequence Snj(ω) which converges


to a finite limit, say R. We can choose a < δ such that eiaSn(ω) <sub>converges and</sub> <sub>e</sub>iaR 6= 1. Therefore


Aa = (eiaR−1)/R, a nonzero quantity. But then


Sn(ω) =


eiaSn(ω)<sub>−</sub><sub>1</sub>


Ra


0 e


itSn(ω)<sub>dt</sub>


→ limn→∞e


iaSn(ω)<sub>−</sub><sub>1</sub>


Aa



.


Therefore, except forω∈N∪N0, we have thatSn(ω) converges.


20. Central limit theorem.


The simplest case of the central limit theorem (CLT) is the case when the Xi are i.i.d., with mean


zero and variance one, and then the CLT says thatSn/




nconverges weakly to a standard normal. We first
prove this case.


We need the fact that ifcn are complex numbers converging toc, then (1 + (cn/n))n →ec. We leave


the proof of this to the reader, with the warning that any proof using logarithms needs to be done with some
care, since logz is a multi-valued function whenz is complex.


Theorem 20.1. Suppose theXi are i.i.d., mean zero, and variance one. ThenSn/




nconverges weakly to
a standard normal.


Proof. Since X1 has finite second moment, then ϕX1 has a continuous second derivative. By Taylor’s


theorem,



ϕX1(t) =ϕX1(0) +ϕ


0


X1(0)t+ϕ


00


X1(0)t


2<sub>/2 +</sub><sub>R(t),</sub>


where|R(t)|/t2<sub>→</sub><sub>0 as</sub><sub>|t| →</sub><sub>0. So</sub>


ϕX1(t) = 1−t


2<sub>/2 +</sub><sub>R(t).</sub>


Then


ϕ<sub>S</sub>


n/




n(t) =ϕSn(t/





n) = (ϕX1(t/




n))n=h1− t


2


2n+R(t/


n)i


n


.


Sincet/√nconverges to zero asn→ ∞, we have


ϕ<sub>S</sub>


n/




n(t)→e−
t2<sub>/2</sub>


.



Now apply the continuity theorem.


</div>
<span class='text_page_counter'>(30)</span><div class='page_container' data-page=30>

Proposition 20.2. WithXi as above,Sn/




nconverges weakly to a standard normal.


Proof. LetY1, . . . , Ynbe i.i.d. standard normal r.v.s that are independent of theXi. LetZ1=Y2+· · ·+Yn,


Z2=X1+Y3+· · ·+Yn,Z3=X1+X2+Y4+· · ·+Yn, etc.


Let us supposeg ∈C3 <sub>with compact support and let</sub><sub>W</sub> <sub>be a standard normal. Our first goal is to</sub>


show


|Eg(Sn/




n)−Eg(W)| →0. (20.1)


We have


Eg(Sn/




n)−Eg(W) =Eg(Sn/





n)−Eg(


n


X


i=1


Yi/



n)
=
n
X
i=1
h
Eg


X<sub>i</sub>+Z<sub>i</sub>


n


−Eg



Y<sub>i</sub>+Z<sub>i</sub>


n
i


.


By Taylor’s theorem,


gXi√+Zi
n




=g(Zi/




n) +g0(Zi/



n)√Xi


n+
1
2g
00<sub>(Z</sub>
i/



n)X<sub>i</sub>2+Rn,


where|Rn| ≤ kg000k∞|Xi|3/n3/2. Taking expectations and using the independence,


Eg


X<sub>i</sub>+Z<sub>i</sub>


n


=<sub>E</sub>g(Zi/




n) + 0 +1
2Eg


00<sub>(Z</sub>


i/




n) +<sub>E</sub>Rn.


We have a very similar expression for<sub>E</sub>g((Yi+Zi)/





n). Taking the difference,




Eg


X<sub>i</sub>+Z<sub>i</sub>


n


−Eg


Y<sub>i</sub>+Z<sub>i</sub>

n


≤ kg
000<sub>k</sub>
∞E


|Xi|3+E|Yi|3


n3/2 .


Summing overi from 1 ton, we have (20.1).



By approximating continuous functions with compact support byC3functions with compact support,
we have (20.1) for suchg. Since E(Sn/




n)2 = 1, the sequenceSn/




n is tight. So givenεthere exists M
such thatP(|Sn/




n|> M)< εfor alln. By taking M larger if necessary, we also haveP(|W|> M)< ε.
Supposegis bounded and continuous. Letψbe a continuous function with compact support that is bounded
by one, is nonnegative, and that equals 1 on [−M, M]. By (20.1) applied togψ,


|E(gψ)(Sn/




n)−E(gψ)(W)| →0.


However,


|Eg(Sn/





n)−E(gψ)(Sn/




n)| ≤ kgk∞P(|Sn/




n|> M)< εkgk∞,
and similarly


|Eg(W)−E(gψ)(W)|< εkgk∞.


Since ε is arbitrary, this proves (20.1) for bounded continuous g. By Proposition 16.1, this proves our
proposition.


</div>
<span class='text_page_counter'>(31)</span><div class='page_container' data-page=31>

Proposition 20.3. Suppose for each nthe r.v.sXni, i= 1, . . . , n are i.i.d. Bernoullis with parameterpn.
Ifnpn →λandSn=P


n


i=1Xni, thenSn converges weakly to a Poisson r.v. with parameter λ.


Proof. We write


ϕSn(t) = [ϕXn1(t)]


n <sub>=</sub>h<sub>1 +</sub><sub>p</sub>



n(eit−1)


in


=h1 + npn
n (e


it<sub>−</sub><sub>1)</sub>in<sub>→</sub><sub>e</sub>λ(eit<sub>−</sub><sub>1)</sub>


.
Now apply the continuity theorem.


A much more general theorem than Theorem 20.1 is the Lindeberg-Feller theorem.


Theorem 20.4. Suppose for eachn,Xni,i= 1, . . . , nare mean zero independent random variables. Suppose
(a) Pn


i=1EX
2


ni→σ2>0 and
(b) for eachε,Pn


i=1E[|X
2


ni;|Xni|> ε]→0.
LetSn=P


n



i=1Xni. ThenSn converges weakly to a normal r.v. with mean zero and varianceσ2.


Note nothing is said about independence of theXni for differentn.


Let us look at Theorem 20.1 in light of this theorem. Suppose theYi are i.i.d. and letXni=Yi/



n.
Then


n


X


i=1


E(Yi/




n)2=<sub>E</sub>Y<sub>1</sub>2


and


n


X


i=1



E[|Xni|2;|Xni|> ε] =nE[|Y1|2/n;|Y1|>




nε] =E[|Y1|2;|Y1|>



nε],


which tends to 0 by the dominated convergence theorem.
If theYi are independent with mean 0, and


Pn


i=1E|Yi|3


(VarSn)3/2


→0,


then Sn/(VarSn)1/2 converges weakly to a standard normal. This is known as Lyapounov’s theorem; we


leave the derivation of this from the Lindeberg-Feller theorem as an exercise for the reader.


Proof. Letϕnibe the characteristic function ofXniand letσni2 be the variance ofXni. We need to show
n


Y



i=1


ϕni(t)→e−t


2<sub>σ</sub>2<sub>/2</sub>


. (20.2)


Using Taylor series,|eib<sub>−</sub><sub>1</sub><sub>−</sub><sub>ib</sub><sub>+</sub><sub>b</sub>2<sub>/2| ≤</sub><sub>c|b|</sub>3 <sub>for a constant</sub><sub>c. Also,</sub>


|eib−1−ib+b2/2| ≤ |eib−1−ib|+|b2|/2≤c|b|2.


If we apply this to a random variabletY and take expectations,


|ϕY(t)−(1 +itEY −t2EY2/2)| ≤c(t2EY2∧t3EY3).


Applying this toY =Xni,


|ϕni(t)−(1−t2σni2 /2)| ≤cE[t
3<sub>|X</sub>


</div>
<span class='text_page_counter'>(32)</span><div class='page_container' data-page=32>

The right hand side is less than or equal to


cE[t3|Xni|3;|Xni| ≤ε] +cE[t2|Xni|2;|Xni|> ε]


≤cεt3<sub>E</sub>[|Xni|2] +ct2E[|Xni|2;|Xni| ≥ε].


Summing overi we obtain


n



X


i=1


|ϕni(t)−(1−t2σ2ni/2)| ≤cεt
3X


E[|Xni|2] +ct2


X


E[|Xni|2;|Xni| ≥ε].


We need the following inequality: if|ai|,|bi| ≤1, then





n
Y
i=1


ai−
n
Y
i=1
bi




n
X
i=1


|ai−bi|.


To prove this, note


Y
ai−


Y


bi= (an−bn)


Y


i<n


bi+an


Y


i<n


ai−


Y



i<n


bi




and use induction.


Note |ϕni(t)| ≤ 1 and |1−t2σni2 /2| ≤1 because σni2 ≤ε2+E[|Xni2|;|Xni| > ε] <1/t2 if we takeε


small enough andnlarge enough. So





n
Y
i=1


ϕni(t)−
n


Y


i=1


(1−t2σ<sub>ni</sub>2 /2)
≤cεt


3X



E[|Xni|2] +ct2


X


E[|Xni|2;|Xni| ≥ε].


Since sup<sub>i</sub>σ2


ni→0, then log(1−t2σni2/2) is asymptotically equal to−t2σ2ni/2, and so


Y


(1−t2σ2<sub>ni</sub>/2) = exp Xlog(1−t2σ<sub>ni</sub>2 /2)


is asymptotically equal to


exp−t2Xσ<sub>ni</sub>2 /2=e−t2σ2/2.
Sinceεis arbitrary, the proof is complete.


We now complete the proof of Theorem 8.2.
Proof of “only if” part of Theorem 8.2. Since P


Xn converges, then Xn must converge to zero a.s.,


and soP(|Xn|> Ai.o.) = 0. By the Borel-Cantelli lemma, this saysPP(|Xn|> A)<∞. We also conclude


by Proposition 5.4 thatP


Yn converges.



Letcn =P
n


i=1VarYi and supposecn→ ∞. LetZnm= (Ym−EYm)/




cn. ThenP
n


m=1VarZnm=


(1/cn)P
n


m=1VarYm = 1. If ε > 0, then for n large, we have 2A/




cn < ε. Since |Ym| ≤ A and hence


|<sub>E</sub>Ym| ≤ A, then |Znm| ≤ 2A/




cn < ε. It follows that P
n


m=1E(|Znm|2;|Znm| > ε) = 0 for large n.



By Theorem 20.4, Pn


m=1(Ym−EYm)/




cn converges weakly to a standard normal. However, P
n
m=1Ym


converges, andcn → ∞, soPYm/




cnmust converge to 0. The quantitiesPEYm/




cn are nonrandom, so


there is no way the difference can converge to a standard normal, a contradiction. We concludecn does not


converge to infinity.


LetVi =Yi−EYi. Since |Vi| <2A, EVi = 0, and VarVi = VarYi, which is summable, by the “if”


part of the three series criterion,P


Vi converges. SincePYi converges, taking the difference shows PEYi



</div>
<span class='text_page_counter'>(33)</span><div class='page_container' data-page=33>

21. Framework for Markov chains.


SupposeS is a set with some topological structure that we will use as our state space. Think of S
as being<sub>R</sub>d <sub>or the positive integers, for example. A sequence of random variables</sub><sub>X</sub>


0, X1, . . ., is a Markov


chain if


P(Xn+1∈A|X0, . . . , Xn) =P(Xn+1∈A|Xn) (21.1)


for all n and all measurable sets A. The definition of Markov chain has this intuition: to predict the
probability that Xn+1 is in any set, we only need to know where we currently are; how we got there gives


no new additional intuition.


Let’s make some additional comments. First of all, we previously considered random variables as
mappings from Ω toR. Now we want to extend our definition by allowing a random variable be a mapX
from Ω toS, where (X ∈A) is F measurable for all open setsA. This agrees with the definition of r.v. in
the caseS =R.


Although there is quite a theory developed for Markov chains with arbitrary state spaces, we will
con-fine our attention to the case where eitherS is finite, in which case we will usually supposeS={1,2, . . . , n},
or countable and discrete, in which case we will usually supposeS is the set of positive integers.


We are going to further restrict our attention to Markov chains where


P(Xn+1∈A|Xn=x) =P(X1∈A|X0=x),



that is, where the probabilities do not depend onn. Such Markov chains are said to have stationary transition
probabilities.


Define the initial distribution of a Markov chain with stationary transition probabilities by µ(i) =
P(X0 = i). Define the transition probabilities by p(i, j) = P(Xn+1 = j | Xn = i). Since the transition


probabilities are stationary,p(i, j) does not depend onn.


In this case we can use the definition of conditional probability given in undergraduate classes. If
P(Xn =i) = 0 for alln, that means we never visitiand we could drop the pointifrom the state space.


Proposition 21.1. LetX be a Markov chain with initial distributionµand transition probabilitiesp(i, j).
then


P(Xn=in, Xn−1=in−1, . . . , X1=i1, X0=i0) =µ(i0)p(i0, i1)· · ·p(in−1, in). (21.2)


Proof. We use induction onn. It is clearly true forn= 0 by the definition ofµ(i). Suppose it holds forn;
we need to show it holds forn+ 1. For simplicity, we will do the case n= 2. Then


P(X3=i3,X2=i2, X1=i1, X0=i0)


=E[P(X3=i3|X0=i0, X1=i1, X2=i2);X2=i2, X1=ii, X0=i0]


=<sub>E</sub>[<sub>P</sub>(X3=i3|X2=i2);X2=i2, X1=ii, X0=i0]


=p(i2, i3)P(X2=i2, X1=i1, X0=i0).


Now by the induction hypothesis,


P(X2=i2, X1=i1, X0=i0) =p(i1, i2)p(i0, i1)µ(i0).



Substituting establishes the claim forn= 3.


</div>
<span class='text_page_counter'>(34)</span><div class='page_container' data-page=34>

Proposition 21.2. Suppose µ(i) is a sequence of nonnegative numbers with P


iµ(i) = 1 and for each i
the sequencep(i, j)is nonnegative and sums to 1. Then there exists a Markov chain withµ(i)as its initial
distribution andp(i, j)as the transition probabilities.


Proof. Define Ω = S∞<sub>. Let</sub> <sub>F</sub> <sub>be the</sub> <sub>σ-fields generated by the collection of sets</sub> <sub>{(i</sub>


0, i1, . . . , in) : n >


0, ij ∈ S}. An element ω of Ω is a sequence (i0, i1, . . .). Define Xj(ω) = ij if ω = (i0, i1, . . .). Define


P(X0 =i0, . . . , Xn=in) by (21.2). Using the Kolmogorov extension theorem, one can show thatP can be
extended to a probability on Ω.


The above framework is rather abstract, but it is clear that under P the sequence Xn has initial


distributionµ(i); what we need to show is thatXn is a Markov chain and that


P(Xn+1=in+1|X0=i0, . . . Xn=in) =P(Xn+1=in+1|Xn=in) =p(in, in+1). (21.3)


By the definition of conditional probability, the left hand side of (21.3) is


P(Xn+1=in+1|X0=i0, . . . , Xn =in) =P


(Xn+1=in+1, Xn=in, . . . X0=i0)



P(Xn=in, . . . , X0=i0)


(21.4)
=µ(i0)· · ·p(in−1, in)p(in, in+1)


µ(i0)· · ·p(in−1, in)


=p(in, in+1)


as desired.


To complete the proof we need to show


P(Xn+1=in+1, Xn=in)


P(Xn =in)


=p(in, in+1),


or


P(Xn+1=in+1, Xn =in) =p(in, in+1)P(Xn =in). (21.5)


Now


P(Xn=in) =


X


i0,···,in−1



P(Xn =in, Xn−1=in−1, . . . , X0=i0)


= X


i0,···,in−1


µ(i0)· · ·p(in−1, in)


and similarly


P(Xn+1=in+1, Xn=in)


= X


i0,···,in−1


P(Xn+1=in+1, Xn=in, Xn−1=in−1, . . . , X0=i0)


=p(in, in+1)


X


i0,···,in−1


µ(i0)· · ·p(in−1, in).


Equation (21.5) now follows.


Note in this construction that theXn sequence is fixed and does not depend onµorp. Letp(i, j) be



fixed. The probability we constructed above is often denoted Pµ. Ifµis point mass at a point i orx, it is
denoted Pi orPx. So we have one probability space, one sequenceXn, but a whole family of probabilities


</div>
<span class='text_page_counter'>(35)</span><div class='page_container' data-page=35>

Later on we will see that this framework allows one to express the Markov property and strong
Markov property in a convenient way. As part of the preparation for doing this, we define the shift operators
θk : Ω→Ω by


θk(i0, i1, . . .) = (ik, ik+1, . . .).


ThenXj◦θk=Xj+k. To see this, ifω= (i0, i1, . . .), then


Xj◦θk(ω) =Xj(ik, ik+1, . . .) =ij+k =Xj+k(ω).


22. Examples.


Random walk on the integers


We let Yi be an i.i.d. sequence of r.v.’s, with p = P(Yi = 1) and 1−p = P(Yi = −1). Let


Xn =X0+Pn<sub>i=1</sub>Yi. Then theXn can be viewed as a Markov chain withp(i, i+ 1) =p,p(i, i−1) = 1−p,


and p(i, j) = 0 if|j−i| 6= 1. More general random walks on the integers also fit into this framework. To
check that the random walk is Markov,


P(Xn+1=in+1|X0=i0, . . . , Xn=in)


=P(Xn+1−Xn=in+1−in|X0=i0, . . . , Xn=in)


=P(Xn+1−Xn=in+1−in),



using the independence, while


P(Xn+1=in+1|Xn=in) =P(Xn+1−Xn=in+1−in|Xn=in)


=<sub>P</sub>(Xn+1−Xn=in+1−in).


Random walks on graphs


Suppose we havenpoints, and from each point there is some probability of going to another point.
For example, suppose there are 5 points and we have p(1,2) = 1


2, p(1,3) =
1


2, p(2,1) =
1


4, p(2,3) =
1
2,


p(2,5) = 1<sub>4</sub>,p(3,1) = 1<sub>4</sub>, p(3,2) = 1<sub>4</sub>, p(3,3) = 1<sub>2</sub>, p(4,1) = 1,p(5,1) = 1<sub>2</sub>, p(5,5) = 1<sub>2</sub>. Thep(i, j) are often
arranged into a matrix:


P=

















0 1<sub>2</sub> 1<sub>2</sub> 0 0


1
4 0
1
2 0
1
4
1
4
1
4
1
2 0 0


1 0 0 0 0


1



2 0 0 0
1
2















.


Note the rows must sum to 1 since


5


X


j=1


p(i, j) =



5


X


j=1


P(X1=j|X0=i) =P(X1∈ S |X0=i) = 1.


Renewal processes


Let Yi be i.i.d. withP(Yi = k) = ak and the ak are nonnegative and sum to 1. Let T0 = i0 and


Tn=T0+P
n


i=1. We think of theYias the lifetime of thenth light bulb andTnthe time when thenth light


</div>
<span class='text_page_counter'>(36)</span><div class='page_container' data-page=36>

SoXn is the amount of time after timenuntil the current light bulb burns out.


IfXn=j andj >0, thenTi =n+j for someibutTi does not equaln, n+ 1, . . . , n+j−1 for any


i. SoTi= (n+ 1) + (j−1) for somei andTi does not equal (n+ 1),(n+ 1) + 1, . . . ,(n+ 1) + (j−2) for


anyi. ThereforeXn+1=j−1. Sop(i, i−1) = 1 if i≥1.


IfXn = 0, then a light bulb burned out at time n andXn+1 is 0 if the next light bulb burned out


immediately andj−1 if the light bulb has lifetimej. The probability of this isaj. Sop(0, j) =aj+1. All



the otherp(i, j)’s are 0.


Branching processes


Considerkparticles. At the next time interval, some of them die, and some of them split into several
particles. The probability that a given particle will split into j particles is given byaj, j = 0,1, . . ., where


the aj are nonnegative and sum to 1. The behavior of each particle is independent of the behavior of all


the other particles. IfXn is the number of particles at timen, thenXn is a Markov chain. LetYi be i.i.d.


random variables with<sub>P</sub>(Yi=j) =aj. Thep(i, j) forXn are somewhat complicated, and can be defined by


p(i, j) =P(Pim=1Ym=j).
Queues


We will discuss briefly theM/G/1 queue. TheM refers to the fact that the customers arrive according
to a Poisson process. So the probability that the number of customers arriving in a time interval of lengtht
iskis given bye−λt<sub>(λt)</sub>k<sub>/k! The</sub><sub>G</sub><sub>refers to the fact that the length of time it takes to serve a customer is</sub>


given by a distribution that is not necessarily exponential. The 1 refers to the fact that there is 1 server.
Suppose the length of time to serve one customer has distribution functionF with density f. The
probability thatkcustomers arrive during the time it takes to serve one customer is


ak=


Z ∞


0



e−λt(λt)


k


k! f(t)dt.


Let theYi be i.i.d. withP(Yi =k−1) =ak. SoYi is the number of customers arriving during the time it


takes to serve one customer. LetXn+1= (Xn+Yn+1)+ be the number of customers waiting. ThenXn is a


Markov chain withp(0,0) =a0+a1 andp(i, j−1 +k) =ak ifj≥1, k >1.
Ehrenfest urns


Suppose we have two urns with a total of r balls, k in one and r−k in the other. Pick one of the
r balls at random and move it to the other urn. LetXn be the number of balls in the first urn. Xn is a


Markov chain withp(k, k+ 1) = (r−k)/r,p(k, k−1) =k/r, andp(i, j) = 0 otherwise.


One model for this is to consider two containers of air with a thin tube connecting them. Suppose
a few molecules of a foreign substance are introduced. Then the number of molecules in the first container
is like an Ehrenfest urn. We shall see that all states in this model are recurrent, so infinitely often all the
molecules of the foreign substance will be in the first urn. Yet there is a tendency towards equilibrium, so
on average there will be about the same number of molecules in each container for all large times.


Birth and death processes


Suppose there are i particles, and the probability of a birth is ai, the probability of a death is bi,


whereai, bi ≥0,ai+bi ≤1. SettingXn equal to the number of particles, thenXn is a Markov chain with



p(i, i+ 1) =ai,p(i, i−1) =bi, andp(i, i) = 1−ai−bi.


23. Markov properties.


A special case of the Markov property says that


</div>
<span class='text_page_counter'>(37)</span><div class='page_container' data-page=37>

The right hand side is to be interpreted as ϕ(Xn), whereϕ(y) =Eyf(X1). The randomness on the right


hand side all comes from the Xn. If we write f(Xn+1) =f(X1)◦θn and we writeY for f(X1), then the


above can be rewritten


Ex[Y ◦θn| Fn] =EXnY.
LetF∞ be theσ-field generated by∪∞


n=1Fn.


Theorem 23.1. (Markov property)IfY is bounded and measurable with respect toF∞, then


Ex[Y ◦θn| Fn] =EXn[Y], P−a.s.


for eachnand x.


Proof. If we can prove this forY =f1(X1)· · ·fm(Xm), then takingfj(x) = 1ij(x), we will have it forY’s


of the form 1(X1=i1,...,Xm=im). By linearity (and the fact thatS is countable), we will then have it for Y’s


of the form 1((X1,...,Xm)∈B). A monotone class argument shows that suchY’s generateF∞.


We use induction onm, and first we prove it form= 1. We need to show



E[f1(X1)◦θn| Fn] =EXnf1(X1).


Using linearity and the fact that S is countable, it suffices to show this for f1(y) = 1{j}(y). Using the
definition ofθn, we need to show


P(Xn+1=j| Fn) =PXn(X1=j),


or equivalently,


Px(Xn+1=j;A) =Ex[PXn(X1=j);A] (23.2)


whenA∈ Fn. By linearity it suffices to considerAof the formA= (X1=i1, . . . , Xn=in). The left hand


side of (23.2) is then


Px(Xn+1=j, X1=i1, . . . , Xn=ij),


and by (21.4) this is equal to


p(in, j)Px(X1=i1, . . . , Xn−in) =p(in, j)Px(A).


Letg(y) =<sub>P</sub>y<sub>(X</sub>


1=j). We have


Px(X1=j, X0=k) =





Pk(X1=j) ifx=k,


0 ifx6=k,


while


Ex[g(X0);X0=k] =Ex[g(k);X0=k] =Pk(X1=j)Px(X0=k) =




P(X1=j) ifx=k,


0 ifx6=k.


It follows that


p(i, j) =Px(X1=j|X0=i) =Pi(X1=j).


So the right hand side of (23.2) is


</div>
<span class='text_page_counter'>(38)</span><div class='page_container' data-page=38>

as required.


Suppose the result holds formand we want to show it holds form+ 1. We have


Ex[f1(Xn+1)· · ·fm+1(Xn+m+1)| Fn]


=<sub>E</sub>x[<sub>E</sub>x[fm+1(Xn+m+1)| Fn+m]f1(Xn+1)· · ·fm(Xn+m)| Fn]


Ex[EXn+m[fm+1(X1)]f1(Xn+1)· · ·fm(Xn+m)| Fn]



Ex[f1(Xn+1)· · ·fm−1(Xn+m−1)h(Xn+m)| Fn].


Here we used the result form= 1 and we definedh(y) =fn+m(y)g(y), whereg(y) =Ey[fm+1(X1)]. Using


the induction hypothesis, this is equal to


EXn[f1(X1)· · ·fm−1(Xm−1)g(Xm)] =EXn[f1(X1)· · ·fm(Xm)EXmfm+1(X1)]


=<sub>E</sub>Xn<sub>[f</sub>


1(X1)· · ·fm(Xm)E[fm+1(Xm+1)| Fm]]


=EXn[f1(X1)· · ·fm+1(Xm+1)],


which is what we needed.


Define θN(ω) = (θN(ω))(ω). The strong Markov property is the same as the Markov property, but


where the fixed timenis replaced by a stopping timeN.


Theorem 23.2. IfY is bounded and measurable andN is a finite stopping time, then


Ex[Y ◦θN | FN] =EXN[Y].


Proof. We will show


Px(XN+1=j | FN) =PXN(X1=j).


Once we have this, we can proceed as in the proof of the Theorem 23.1 to obtain our result. To show the
above equality, we need to show that ifB ∈ FN, then



Px(XN+1=j, B) =Ex[PXN(X1=j);B]. (23.3)


Recall that sinceB∈ FN, then B∩(N =k)∈ Fk. We have


Px(XN+1=j, B, N=k) =Px(Xk+1=j, B, N =k)


=Ex[Px(Xk+1=j| Fk);B, N=k]


=<sub>E</sub>x[<sub>P</sub>Xk<sub>(X</sub>


1=j);B, N=k]


=Ex[PXN(X1=j);B, N=k].


Now sum overk; sinceN is finite, we obtain our desired result.


Another way of expressing the Markov property is through the Chapman-Kolmogorov equations. Let
pn<sub>(i, j) =</sub>


P(Xn =j |X0=i).


Proposition 23.3. For alli, j, m, n we have


pn+m(i, j) =X


k∈S


</div>
<span class='text_page_counter'>(39)</span><div class='page_container' data-page=39>

Proof. We write



P(Xn+m=j, X0=i) =


X


k


P(Xn+m=j, Xn=k, X0=i)


=X


k


P(Xn+m=j|Xn=k, X0=i)P(Xn=k|X0=i)P(X0=i)


=X


k


P(Xn+m=j|Xn=k)pn(i, k)P(X0=i)


=X


k


pm(k, j)pn(i, k)P(X0=i).


If we divide both sides by<sub>P</sub>(X0=i), we have our result.


Note the resemblance to matrix multiplication. It is clear if P is the matrix made up of thep(i, j),
thenPn <sub>will be the matrix whose (i, j) entry is</sub><sub>p</sub>n<sub>(i, j).</sub>



24. Recurrence and transience.
Let


Ty= min{i >0 :Xi=y}.


This is the first time thatXihits the point y. Even ifX0=ywe would have Ty >0. We letTyk be thek-th


time that the Markov chain hitsy and we set


r(x, y) =Px(Ty <∞),


the probability starting atxthat the Markov chain ever hitsy.
Proposition 24.1. Px(Tyk <∞) =r(x, y)r(y, y)k−1.


Proof. The case k= 1 is just the definition, so supposek >1. Using the strong Markov property,
Px(Tyk <∞) =P


x<sub>(T</sub>


y◦θTyk−1 <∞, T


k−1
y <∞)


=<sub>E</sub>x[<sub>P</sub>x(Ty◦θ<sub>T</sub>k−1


y <∞ | FTyk−1);T


k−1


y <∞]


=<sub>E</sub>x[<sub>P</sub>X(Tyk−1)<sub>(T</sub>


y<∞);Tyk−1]


=Ex[Py(Ty<∞);Tyk−1<∞]


=r(y, y)<sub>P</sub>x(T<sub>y</sub>k−1<∞).


We used here the fact that at timeTyk−1the Markov chain must be at the pointy. Repeating this argument


k−2 times yields the result.


We say thaty is recurrent ifr(y, y) = 1; otherwise we sayy is transient. Let


N(y) =

X


n=1


1(Xn=y).


Proposition 24.2. y is recurrent if and only ifEyN(y) =∞.
Proof. Note


EyN(y) =

X



k=1


Py(N(y)≥k) =

X


k=1


Py(Tyk<∞)


=

X


k=1


</div>
<span class='text_page_counter'>(40)</span><div class='page_container' data-page=40>

We used the fact thatN(y) is the number of visits toy and the number of visits being larger than kis the
same as the time of thek-th visit being finite. Sincer(y, y)≤1, the left hand side will be finite if and only
ifr(y, y)<1.


Observe that


EyN(y) =X


n


Py(Xn=y) =


X



n


pn(y, y).


If we consider simple symmetric random walk on the integers, thenpn<sub>(0,</sub><sub>0) is 0 if</sub><sub>n</sub><sub>is odd and equal</sub>


to


n
n/2




2−n <sub>if</sub> <sub>n</sub><sub>is even. This is because in order to be at 0 after</sub> <sub>n</sub><sub>steps, the walk must have had</sub> <sub>n/2</sub>


positive steps andn/2 negative steps; the probability of this is given by the binomial distribution. Using
Stirling’s approximation, we see thatpn<sub>(0,</sub><sub>0)</sub><sub>∼</sub><sub>c/</sub>√<sub>n</sub><sub>for</sub><sub>n</sub><sub>even, which diverges, and so simple random walk</sub>


in one dimension is recurrent.


Similar arguments show that simple symmetric random walk is also recurrent in 2 dimensions but
transient in 3 or more dimensions.


Proposition 24.3. Ifxis recurrent andr(x, y)>0, theny is recurrent andr(y, x) = 1.


Proof. First we showr(y, x) = 1. Suppose not. Since r(x, y)>0, there is a smallest nand y1, . . . , yn−1


such thatp(x, y1)p(y1, y2)· · ·p(yn−1, y)>0. Since this is the smallestn, none of theyi can equalx. Then



Px(Tx=∞)≥p(x, y1)· · ·p(yn−1, y)(1−r(y, x))>0,


a contradiction toxbeing recurrent.


Next we show thaty is recurrent. Sincer(y, x)>0, there existsLsuch thatpL(y, x)>0. Then


pL+n+K(y, y)≥pL(y, x)pn(x, x)pK(x, y).


Summing overn,


X


n


pL+n+K(y, y)≥pL(y, x)pK(x, y)X


n


pn(x, x) =∞.


We say a subsetC ofS is closed ifx∈C andr(x, y)>0 impliesy∈C. A subsetD is irreducible if
x, y∈D impliesr(x, y)>0.


Proposition 24.4. LetCbe finite and closed. ThenC contains a recurrent state.


From the preceding proposition, ifCis irreducible, then all states will be recurrent.
Proof. If not, for ally we have r(y, y)<1 and


ExN(y) =



X


k=1


r(x, y)r(y, y)k−1= r(x, y)


1−r(y, y)<∞.


SinceC is finite, thenP


yE


x<sub>N(y)</sub><sub><</sub><sub>∞. But that is a contradiction since</sub>


X


y


ExN(y) =
X


y


X


n


pn(x, y) =X



n


X


y


pn(x, y) =X


n


Px(Xn∈C) =


X


n


</div>
<span class='text_page_counter'>(41)</span><div class='page_container' data-page=41>

Theorem 24.5. LetR={x:r(x, x) = 1}, the set of recurrent states. ThenR=∪∞


i=1Ri, where eachRi is
closed and irreducible.


Proof. Sayx∼y ifr(x, y)>0. Since every state is recurrent,x∼xand ifx∼y, theny∼x. Ifx∼y and
y∼z, then pn<sub>(x, y)</sub><sub>></sub><sub>0 and</sub><sub>p</sub>m<sub>(y, z)</sub><sub>></sub><sub>0 for some</sub> <sub>n</sub><sub>and</sub><sub>m. Then</sub><sub>p</sub>n+m<sub>(x, z)</sub><sub>></sub><sub>0 or</sub> <sub>x</sub><sub>∼</sub><sub>z. Therefore we</sub>


have an equivalence relation and we let theRi be the equivalence classes.


Looking at our examples, it is easy to see that in the Ehrenfest urn model all states are recurrent. For
the branching process model, supposep(x,0)>0 for allx. Then 0 is recurrent and all the other states are
transient. In the renewal chain, there are two cases. If{k:ak >0} is unbounded, all states are recurrent.



IfK= max{k:ak>0}, then{0,1, . . . , K−1}are recurrent states and the rest are transient.


For the queueing model, letµ=P<sub>ka</sub>


k, the expected number of people arriving during one customer’s


service time. We may view this as a branching process by letting all the customers arriving during one
person’s service time be considered the progeny of that customer. It turns out that ifµ≤1, 0 is recurrent
and all other states are also. Ifµ >1 all states are transient.


25. Stationary measures.


A probabilityµis a stationary distribution if


X


x


µ(x)p(x, y) =µ(y). (25.1)


In matrix notation this isµP =µ, orµis the left eigenvector corresponding to the eigenvalue 1. In the case of
a stationary distribution,<sub>P</sub>µ<sub>(X</sub>


1=y) =µ(y), which implies thatX1, X2, . . .all have the same distribution.


We can use (25.1) when µ is a measure rather than a probability, in which case it is called a stationary
measure.


If we have a random walk on the integers,µ(x) = 1 for allxserves as a stationary measure. In the
case of an asymmetric random walk: p(i, i+ 1) =p,p(i, i−1) =q= 1−pandp6=q, setting µ(x) = (p/q)x


also works.


In the Ehrenfest urn model,µ(x) = 2−r



r
x




works. One way to see this is thatµis the distribution
one gets if one flipsrcoins and puts a coin in the first urn when the coin is heads. A transition corresponds
to picking a coin at random and turning it over.


Proposition 25.1. Letabe recurrent and let T =Ta. Set


µ(y) =<sub>E</sub>a


T−1


X


n=0


1(Xn=y).


Thenµis a stationary measure.


The idea of the proof is thatµ(y) is the expected number of visits toy by the sequenceX0, . . . , XT−1



while µP is the expected number of visits to y by X1, . . . , XT. These should be the same because XT =


X0=a.


Proof. First, letp<sub>n</sub>(a, y) =<sub>P</sub>a<sub>(X</sub>


n =y, T > n). So


µ(y) =

X


n=0


Pa(Xn=y, T > n) =



X


n=0


</div>
<span class='text_page_counter'>(42)</span><div class='page_container' data-page=42>

and


X


y


µ(y)p(y, z) =X


y




X


n=0


pn(a, y)p(y, z).


Second, we consider the casez6=a. Then
X


y


p<sub>n</sub>(a, y)p(y, z)


=X


y


Pa(hity innsteps without first hittingaand then go toz in one step)


=p<sub>n+1</sub>(a, z).


So


X


n


µ(y)p(y, z) =X



n


X


y


p<sub>n</sub>(a, y)p(y, z)


=

X


n=0


p<sub>n+1</sub>(a, z) =

X


n=0


p<sub>n</sub>(a, z)


=µ(z)
sincep<sub>0</sub>(a, z) = 0.


Third, we consider the casea=z. Then


X



y


pn(a, y)p(y, z)


=X


y


Pa(hity innsteps without first hittingaand then go toz in one step)


=<sub>P</sub>a(T =n+ 1).


Recall<sub>P</sub>a<sub>(T</sub> <sub>= 0) = 0, and since</sub><sub>a</sub><sub>is recurrent,</sub><sub>T <</sub><sub>∞. So</sub>


X


y


µ(y)p(y, z) =X


n


X


y


p<sub>n</sub>(a, y)p(y, z)


=


X


n=0


Pa(T =n+ 1) =

X


n=0


Pa(T =n) = 1.


On the other hand,


T−1


X


n=0


1(Xn=a)= 1(X0=a)= 1,


henceµ(a) = 1. Therefore, whetherz6=aorz=a, we haveµP(z) =µ(z).


Finally, we showµ(y)<∞. Ifr(a, y) = 0, thenµ(y) = 0. Ifr(a, y)>0, choosenso thatpn<sub>(a, y)</sub><sub>></sub><sub>0,</sub>


and then


1 =µ(a) =X



y


µ(y)pn(a, y),


which impliesµ(y)<∞.


</div>
<span class='text_page_counter'>(43)</span><div class='page_container' data-page=43>

Proposition 25.2. If the Markov chain is irreducible and all states are recurrent, then the stationary
measure is unique up to a constant multiple.


Proof. Fixa∈ S. Letµa be the stationary measure constructed above and letν be any other stationary


measure.


Sinceν=νP, then


ν(z) =ν(a)p(a, z) +X


y6=a


ν(y)p(y, z)


=ν(a)p(a, z) +X


y6=a


ν(a)p(a, y)p(y, z) +X


x6=a


X



y6=a


ν(x)p(x, y)p(y, z)


=ν(a)<sub>P</sub>a(X1=z) +ν(a)Pa(X16=a, X2=z) +Pν(X06=a, X16=a, X2=z).


Continuing,


ν(z) =ν(a)


n


X


m=1


Pa(X16=a, X26=a, . . . , Xm−16=a, Xm=z)


+Pν(X06=a, X16=a, . . . , Xn−16=a, Xn=z)


≥ν(a)


n


X


m=1


Pa(X16=a, X26=a, . . . , Xm−16=a, Xm=z).



Lettingn→ ∞, we obtain


ν(z)≥ν(a)µa(z).


We have


ν(a) =X


x


ν(x)pn(x, a)≥ν(a)X


x


µa(x)pn(x, a)


=ν(a)µa(a) =ν(a),


sinceµa(a) = 1 (see proof of Proposition 25.1). This means that we have equality and so


ν(x) =ν(a)µa(x)


wheneverpn(x, a)>0. Sincer(x, a)>0, this happens for somen. Consequently
ν(x)


ν(a) =µa(x).


Proposition 25.3. If a stationary distribution exists, thenµ(y)>0impliesy is recurrent.



Proof. Ifµ(y)>0, then
∞=



X


n=1


µ(y) =

X


n=1


X


x


µ(x)pn(x, y) =X


x


µ(x)

X


n=1


pn(x, y)



=X


x


µ(x)

X


n=1


Px(Xn=y) =


X


x


µ(x)<sub>E</sub>xN(y)


=X


x


µ(x)r(x, y)[1 +r(y, y) +r(y, y)2+· · ·].


Sincer(x, y)≤1 andµis a probability measure, this is less than
X


x


µ(x)(1 +r(y, y) +· · ·)≤1 +r(y, y) +· · ·.



Hencer(y, y) must equal 1.


</div>
<span class='text_page_counter'>(44)</span><div class='page_container' data-page=44>

Proposition 25.4. If the Markov chain is irreducible and has stationary distributionµ, then


µ(x) = 1
ExTx


.


Proof. µ(x)>0 for somex. Ify∈ S, thenr(x, y)>0 and sopn<sub>(x, y)</sub><sub>></sub><sub>0 for some</sub><sub>n. Hence</sub>


µ(y) =X


x


µ(x)pn(x, y)>0.


Hence by Proposition 25.3, all states are recurrent. By the uniqueness of the stationary distribution,µx is


a constant multiple ofµ, i.e., µx=cµ. Recall


µx(y) =



X


n=0


Px(Xn=y, Tx> n),



and so


X


y


µx(y) =


X


y



X


n=0


Px(Xn=y, Tx> n) =


X


n


X


y


Px(Xn =y, Tx> n)



=X


n


Px(Tx> n) =ExTx.


Thusc=ExTx. Recalling that µx(x) = 1,


µ(x) =µx(x)


c =


1
ExTx


.


We make the following distinction for recurrent states. If<sub>E</sub>xTx <∞, then xis said to be positive


recurrent. Ifxis recurrent but<sub>E</sub>xTx=∞,xis null recurrent.


Proposition 25.5. Suppose a chain is irreducible.


(a) If there exists a positive recurrent state, then there is a stationary distribution,.
(b) If there is a stationary distribution, all states are positive recurrent.


(c) If there exists a transient state, all states are transient.


(d) If there exists a null recurrent state, all states are null recurrent.



Proof. To show (a), ifxis positive recurrent, then there exists a stationary measure withµ(x) = 1. Then
µ(y) =µ(y)/ExTx will be a stationary distribution.


For (b), supposeµ(x)>0 for somex. We showed this implies µ(y)>0 for ally. Then 0< µ(y) =
1/EyTy, which impliesEyTy <∞.


We showed that ifxis recurrent andr(x, y)>0, theny is recurrent. So (c) follows.


Suppose there exists a null recurrent state. If there exists a positive recurrent or transient state as
well, then by (a) and (b) or by (c) all states are positive recurrent or transient, a contradiction, and (d)
follows.


26. Convergence.


Our goal is to show that under certain conditionspn<sub>(x, y)</sub><sub>→</sub><sub>π(y), where</sub><sub>π</sub> <sub>is the stationary </sub>


distri-bution. (In the null recurrent casepn<sub>(x, y)</sub><sub>→</sub><sub>0.)</sub>


Consider a random walk on the set{0,1}, where with probability one on each step the chain moves
to the other state. Then pn<sub>(x, y) = 0 if</sub> <sub>x</sub><sub>6=</sub><sub>y</sub> <sub>and</sub><sub>n</sub><sub>is even. A less trivial case is the simple random walk</sub>


on the integers. We need to eliminate this periodicity.


Supposexis recurrent, let Ix ={n≥1 :pn(x, x)>0}, and let dx be the g.c.d. (greatest common


</div>
<span class='text_page_counter'>(45)</span><div class='page_container' data-page=45>

Proposition 26.1. Ifr(x, y)>0, thendy=dx.


Proof. Sincexis recurrent,r(y, x)>0. ChooseK andLsuch thatpK<sub>(x, y), p</sub>L<sub>(y, x)</sub><sub>></sub><sub>0.</sub>


pK+L+n(y, y)≥pL(y, x)pn(x, x)pK(x, y),



so taking n= 0, we have pK+L<sub>(y, y)</sub><sub>></sub><sub>0, or</sub><sub>d</sub>


y divides K+L. So dy divides n ifpn(x, x)>0, ordy is a


divisor ofIx. Hencedy dividesdx. By symmetrydx dividesdy.


Proposition 26.2. Ifdx= 1, there existsm0 such thatpm(x, x)>0 wheneverm≥m0.


Proof. First of all, Ix is closed under addition: ifm, n∈Ix,


pm+n(x, x)≥pm(x, x)pn(x, x)>0.


Secondly, if there existsN such thatN, N+ 1∈Ix, letm0=N2. Ifm≥m0, thenm−N2=kN+r


for somer < N and


m=r+N2+kN =r(N+ 1) + (N−r+k)N ∈Ix.


Third, pickn0∈Ixandk >0 such thatn0+k∈Ix. Ifk= 1, we are done. Sincedx= 1, there exists


n1∈Ixsuch thatkdoes not dividen1. We haven1=mk+rfor some 0< r < k. Note (m+ 1)(n0+k)∈Ix


and (m+ 1)n0+n1∈Ix. The difference between these two numbers is (m+ 1)k−n1=k−r < k. So now


we have two numbers inIk differing by less than or equal to k−1. Repeating at most ktimes, we get two


numbers inIx differing by at most 1, and we are done.


We writedfordx. A chain is aperiodic ifd= 1.



Ifd >1, we sayx∼y ifpkd(x, y)>0 for somek >0 We divideS into equivalence classesS1, . . .Sd.


Everydsteps the chain started inSi is back inSi. So we look atp0=pd onSi.


Theorem 26.3. Suppose the chain is irreducible, aperiodic, and has a stationary distribution π. Then


pn<sub>(x, y)</sub><sub>→</sub><sub>π(y)</sub><sub>as</sub><sub>n</sub><sub>→ ∞</sub><sub>.</sub>


Proof. The idea is to take two copies of the chain with different starting distributions, let them run
independently until they couple, i.e., hit each other, and then have them move together. So define


q((x1, y1),(x2, y2)) =


(<sub>p(x</sub>


1, x2)p(y1, y2) ifx16=y1,


p(x1, x2) ifx1=y1, x2=y2,


0 otherwise.


LetZn= (Xn, Yn) andT = min{i:Xi=Yi}. We have


P(Xn=y) =P(Xn=y, T ≤n) +P(Xn=y, T > n)


=<sub>P</sub>(Yn =y, T ≤n) +P(Xn=y, T > n),


while



P(Yn=y) =P(Yn =y, T ≤n) +P(Yn =y, T > n).


Subtracting,


P(Xn =y)−P(Yn=y)≤P(Xn=y, T > n)−P(Yn=y, T > n)


</div>
<span class='text_page_counter'>(46)</span><div class='page_container' data-page=46>

Using symmetry,


|P(Xn=y)−P(Yn =y)| ≤P(T > n).
Suppose we letY0 have distributionπandX0=x. Then


|pn<sub>(x, y)</sub><sub>−</sub><sub>π(y)| ≤</sub>


P(T > n).


It remains to show<sub>P</sub>(T > n)→0. To do this, consider another chainZ<sub>n</sub>0 = (Xn, Yn), where now we


takeXn, Yn independent. Define


r((x1, y1),(x2, y2)) =p(x1, x2)p(y1, y2).


The chain under the transition probabilitiesr is irreducible. To see this, there existK andL such
that pK<sub>(x</sub>


1, x2) > 0 and pL(y1, y2) > 0. If M is large, pL+M(x2, x2) > 0 and pK+M(y2, y2) > 0. So


pK+L+M<sub>(x</sub>


1, x2)>0 andpK+L+M(y1, y2)>0, and hence we haverK+L+M((x1, x2),(y1, y2))>0.



It is easy to check thatπ0(a, b) =π(a)π(b) is a stationary distribution forZ0. HenceZ<sub>n</sub>0 is recurrent,
and hence it will hit (x, x), hence the time to hit the diagonal {(y, y) : y ∈ S} is finite. However the
distribution of the time to hit the diagonal is the same asT.


27. Gaussian sequences.


We first prove a converse to Proposition 17.3.
Proposition 27.1. If<sub>E</sub>ei(uX+vY)<sub>=</sub>


EeiuXEeivY for alluandv, thenX and Y are independent random


variables.


Proof. LetX0 be a random variable with the same law asX, Y0 one with the same law asY, and X0, Y0
independent. (We let Ω = [0,1]2,PLebesgue measure,X0 a function of the first variable, andY0 a function
of the second variable defined as in Proposition 1.2.) ThenEei(uX


0<sub>+vY</sub>0<sub>)</sub>


=EeiuX


0


EeivY


0


. SinceX, X0 have
the same law, they have the same characteristic function, and similarly forY, Y0<sub>. Therefore (X</sub>0<sub>, Y</sub>0<sub>) has the</sub>
same joint characteristic function as (X, Y). By the uniqueness of the Fourier transform, (X0, Y0) has the


same joint law as (X, Y), which is easily seen to imply thatX andY are independent.


A sequence of random variables X1, . . . , Xn is said to be jointly normal if there exists a sequence


of independent standard normal random variables Z1, . . . , Zm and constants bij and ai such that Xi =


Pm


j=1bijZj+ai,i= 1, . . . , n. In matrix notation,X =BZ+A. For simplicity, in what follows let us take


A= 0; the modifications for the general case are easy. The covariance of two random variablesX andY is
defined to be<sub>E</sub>[(X−<sub>E</sub>X)(Y −<sub>E</sub>Y)]. Since we are assuming our normal random variables are mean 0, we
can omit the centering at expectations. Given a sequence of mean 0 random variables, we can talk about
the covariance matrix, which is Cov (X) =<sub>E</sub>XXt<sub>, where</sub> <sub>X</sub>t<sub>denotes the transpose of the vector</sub><sub>X</sub><sub>. In the</sub>


above case, we see Cov (X) =E[(BZ)(BZ)t] =E[BZZtBt] =BBt, sinceEZZt=I, the identity.
Let us compute the joint characteristic functionEeiu


t<sub>X</sub>


of the vectorX, whereuis ann-dimensional
vector. First, ifv is anm-dimensional vector,


Eeiv


t<sub>Z</sub>


=E


m



Y


j=1


eivjZj <sub>=</sub>


m


Y


j=1


EeivjZj <sub>=</sub>


m


Y


j=1


e−v2j/2<sub>=</sub><sub>e</sub>−vtv/2


using the independence of theZ’s. So
Eeiu


t<sub>X</sub>


=Eeiu



t<sub>BZ</sub>


=e−utBBtu/2.


</div>
<span class='text_page_counter'>(47)</span><div class='page_container' data-page=47>

Proposition 27.2. If theXiare jointly normal andCov (Xi, Xj) = 0fori6=j, then theXiare independent.


Proof. If Cov (X) =BBt<sub>is a diagonal matrix, then the joint characteristic function of the</sub><sub>X’s factors, and</sub>


so by Proposition 27.1, theXs would in this case be independent.


28. Stationary processes.


In this section we give some preliminaries which will be used in the next on the ergodic theorem. We
say a sequenceXi isstationaryif (Xk, Xk+1, . . .) has the same distribution as (X0, X1, . . .).


One example is if theXi are i.i.d. For readers who are familiar with Markov chains, another is ifXi


is a Markov chain,π is the stationary distribution, andX0 has distributionπ.


A third example is rotations of a circle. Let Ω be the unit circle,Pnormalized Lebesgue measure on
Ω, andθ∈[0,2π). We letX0(ω) =ω and setXn(ω) =ω+nθ(mod 2π).


A fourth example is the Bernoulli shift: let Ω = [0,1),P Lebesgue measure, X0(ω) =ω, andXn(ω)


be binary expansion ofω from thenth place on.


Proposition 28.1. IfXn is stationary, thenYk =g(Xk, Xk+1, . . .)is stationary.


Proof. IfB⊂<sub>R</sub>∞<sub>, let</sub>



A={x= (x0, x1, . . .) : (g(x0, . . .), g(x1, . . . ,), . . .)∈B}.


Then


P((Y0, Y1, . . .)∈B) =P((X0, X1, . . .)∈A)


=P((Xk, Xk+1, . . .)∈A)


=<sub>P</sub>((Yk, Yk+1, . . .)∈B).


We say thatT : Ω→Ω ismeasure preservingif<sub>P</sub>(T−1<sub>A) =</sub>


P(A) for allA∈ F.


There is a one-to-one correspondence between measure preserving transformations and stationary
sequences. GivenT, letX0=ω andXn=Tnω. Then


P((Xk, Xk+1, . . .)∈A) =P(Tk(X0, X1, . . .)∈A) =P((X0, X1, . . .)∈A).


On the other hand, ifXkis stationary, defineΩ =b R∞, and defineXbk(ω) =ωk, whereω= (ω0, ω1, . . .). Define


b


P onΩ so that the law ofb Xb underPb is the same as the law ofX underP. Then define T ω= (ω1, ω2, . . .).


We see that


P(A) =P((ω0, ω1, . . .)∈A) =Pb((Xb0,Xb1, . . .)∈A)


=<sub>P</sub>((X0, X1, . . .)∈A) =P((X1, X2, . . .)∈A)



=Pb((Xb1,Xb2, . . .)∈A) =P((ω1, ω2, . . .)∈A)


=<sub>P</sub>(T ω∈A) =<sub>P</sub>(T−1A).


We say a setAisinvariantifT−1<sub>A</sub><sub>=</sub><sub>A</sub><sub>(up to a null set, that is, the symmetric difference has </sub>


prob-ability zero). The invariantσ-fieldI is the collection of invariant sets. A measure preserving transformation
isergodicif the invariantσ-field is trivial.


In the case of an i.i.d. sequence,Ainvariant means A=T−n<sub>A</sub><sub>∈</sub><sub>σ(X</sub>


n, Xn+1, . . .) for eachn. Hence


each invariant set is in the tailσ-field, and by the Kolmogorov 0-1 law,T is ergodic.


</div>
<span class='text_page_counter'>(48)</span><div class='page_container' data-page=48>

recall that iff is measurable and bounded, thenf is theL2 limit ofPK


k=−Kckeikx, whereck are the Fourier


coefficients. So


f(Tnx) =Xckeikx+iknθ


=Xdkeikx,


where dk =ckeiknθ. Iff(Tnx) =f(x) a.e., then ck =dk, or ckeiknθ =ck. Butθ is not a rational multiple


ofπ, soeiknθ<sub>6= 1, so</sub><sub>c</sub>



k = 0. Thereforef = 0 a.e. If we takef = 1A, this says that eitherAis empty or A


is the whole space, up to sets of measure zero.


Our last example was the Bernoulli shift. Let Xi be i.i.d. withP(X = 1) =P(Xi = 0) = 1/2. Let


Yn = P∞<sub>m=0</sub>2−(m+1)Xn+m. So there exists g such that Yn = g(Xn, Xn+1, . . .). If A is invariant for the


Bernoulli shift,


A= ((Yn, Yn+1, . . .)∈B) = ((Xn, Xn+1, . . .)∈C),


where C={x: (g(x0, x1, . . .), g(x1, x2, . . .), . . .)∈B}. this is true for all n, soA is in the invariant σ-field


for theXi’s, which is trivial. ThereforeT is ergodic.


29. The ergodic theorem.


The key to the ergodic theorem is the following maximal lemma.


Lemma 29.1. LetX be integrable. LetT be a measure preserving transformation, letXj(ω) =X(Tjω),
letSk(ω) =X0(ω) +· · ·+Xk−1(ω), and Mk(ω) = max(0, S1(ω), . . . , Sk(ω)). ThenE[X;Mk >0]≥0.


Proof. Ifj≤k,Mk(T ω)≥Sj(T ω), so X(ω) +Mk(T ω)≥X(ω) +Sj(T ω) =Sj+1(ω), or


X(ω)≥Sj+1(ω)−Mk(T ω), j= 1, . . . , k.


SinceS1(ω) =X(ω) andMk(T ω)≥0, then


X(ω)≥S1(ω)−Mk(T ω).



Therefore


E[X(ω);Mk>0]≥


Z


(Mk>0)


[max(S1, . . . , Sk)(ω)−Mk(T ω)]


=
Z


(Mk>0)


[Mk(ω)−Mk(T ω)].


On the set (Mk = 0) we haveMk(ω)−Mk(T ω) =−Mk(T ω)≤0. Hence


E[X(ω);Mk >0]≥


Z


[Mk(ω)−Mk(T ω)].


SinceT is measure preserving,EMk(ω)−EMk(T ω) = 0, which completes the proof.


RecallI is the invariantσ-field. The ergodic theorem says the following.
Theorem 29.2. LetT be measure preserving andX integrable. Then



1
n


n−1


X


m=0


X(Tmω)→E[X | I],


where the convergence takes place almost surely and inL1<sub>.</sub>


</div>
<span class='text_page_counter'>(49)</span><div class='page_container' data-page=49>

Letδ >0. SinceX is integrable, P


P(|Xn(ω)|> δn) =PP(|X|> δn)<∞(cf. proof of Proposition
5.1). By Borel-Cantelli,|Xn|/nwill eventually be less than δ. Sinceδis arbitrary,|Xn|/n→0 a.s. Since


(Sn/n)(T ω)−(Sn/n)(ω) =Xn(ω)/n−X0(ω)/n→0,


then lim sup(Sn/n)(T ω) = lim sup(Sn/n)(ω), and soD ∈ I. Let X∗(ω) = (X(ω)−ε)1D(ω), and defineSn∗


andM<sub>n</sub>∗ analogously to the definitions ofSn andMn. OnD, lim sup(Sn/n)> ε, hence lim sup(Sn∗/n)>0.


Let F = ∪n(Mn∗ > 0). Note ∪ni=0(Mi∗ > 0) = (Mn∗ > 0). Also |X∗| ≤ |X|+ε is integrable. By


Lemma 29.1,E[X∗;Mn∗>0]≥0. By dominated convergence, E[X∗;F]≥0.


We claim D =F, up to null sets. To see this, if lim sup(S<sub>n</sub>∗/n)>0, then ω ∈ ∪n(Mn∗ >0). Hence



D⊂F. On the other hand, ifω∈F, then M<sub>n</sub>∗>0 for some n, soX<sub>n</sub>∗6= 0 for some n. By the definition of
X∗, for some n,Tnω∈D, and sinceDis invariant,ω∈D a.s.


RecallD∈ I. Then


0≤E[X∗;D] =E[X−ε;D]


=E[E[X | I];D]−εP(D) =−εP(D),
using the fact that<sub>E</sub>[X | I] = 0. We conclude<sub>P</sub>(D) = 0 as desired.


Since we have this for everyε, then lim supSn/n ≤0. By applying the same argument to −X, we


obtain lim infSn/n≥0, and we have proved the almost sure result. Let us now turn to theL1convergence.


LetM >0,X<sub>M</sub>0 =X1(|X|≤M), andXM00 =X−XM00. By the almost sure result,


1
n


X


X<sub>M</sub>0 (Tmω)→E[XM0 | I]


almost surely. Both sides are bounded byM, so


E




1
n


X


X<sub>M</sub>0 (Tmω)−E[XM0 | I]





→0. (29.1)


Letε > 0 and chooseM large so that <sub>E</sub>|X00


M| < ε; this is possible by dominated convergence. We


have


E



1
n


n−1


X


m=0



XM00(Tmω)






1
n


X


E|XM00(Tmω)|=E|XM00 | ≤ε


and


E|E[XM00 | I]| ≤E[E[|X


00


M| | I]] =E|X


00


M| ≤ε.


So combining with (29.1)


lim sup<sub>E</sub>


1
n


X


X(Tmω)−<sub>E</sub>[X | I]
≤2ε.
This shows theL1 convergence.


What does the ergodic theorem tell us about our examples? In the case of i.i.d. random variables, we
seeSn/n→EX almost surely and in L1, since E[X |I] =EX. Thus this gives another proof of the SLLN.
For rotations of the circle withX(ω) = 1A(ω) andθ is an irrational multiple ofπ,E[X | I] =EX=
P(A), the normalized Lebesgue measure of A. So the ergodic theorem says that (1/n)P1A(ω+nθ), the


average number of timesω+nθ is inA, converges for almost everyω to the normalized Lebesgue measure
ofA.


</div>

<!--links-->

×