Bài giảng Tối ưu hóa nâng cao - Chương 5: Gradient descent

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (384.47 KB, 10 trang )

Gradient Descent

Hoàng Nam Dũng

</div>
<span class='text_page_counter'>(2)</span><div class='page_container' data-page=2>

Gradient descent

Consider unconstrained, smooth convex optimization

min
x f(x)

with convex and differentiable functionf :Rn→R. Denote the optimal

value byf∗ = minxf(x)and a solution byx∗.

Gradient descent: choose initial pointx(0) <sub>∈</sub>Rn, repeat:
x(k)=x(k−1)<sub>−</sub>tk <sub>· ∇</sub>f(x(k−1)<sub>)</sub><sub>,</sub> <sub>k</sub> <sub>=</sub><sub>1</sub><sub>,</sub><sub>2</sub><sub>,</sub><sub>3</sub><sub>, . . .</sub>

</div>
<span class='text_page_counter'>(3)</span><div class='page_container' data-page=3>

Gradient descent

Consider unconstrained, smooth convex optimization

min
x f(x)

with convex and differentiable functionf :Rn→R. Denote the optimal

value byf∗ = minxf(x)and a solution byx∗.

Gradient descent: choose initial pointx(0) <sub>∈</sub>Rn, repeat:
x(k)=x(k−1)<sub>−</sub>tk<sub>· ∇</sub>f(x(k−1)<sub>)</sub><sub>,</sub> <sub>k</sub> <sub>=</sub><sub>1</sub><sub>,</sub><sub>2</sub><sub>,</sub><sub>3</sub><sub>, . . .</sub>

</div>
<span class='text_page_counter'>(4)</span><div class='page_container' data-page=4>

●

</div>
<span class='text_page_counter'>(5)</span><div class='page_container' data-page=5>

●

</div>
<span class='text_page_counter'>(6)</span><div class='page_container' data-page=6>

Gradient descent interpretation

At each iteration, consider the expansion

f(y)<sub>≈</sub>f(x) +<sub>∇</sub>f(x)T<sub>(y</sub>

−x) + 1

2t ky−xk

2
2.

Quadratic approximation, replacing usual Hessian<sub>∇</sub>2<sub>f</sub><sub>(x)</sub><sub>by</sub> 1

tI.
f(x) +<sub>∇</sub>f(x)T<sub>(y</sub>

−x) linear approximation to f

2tky−xk

2 proximity term tox, with weight 1/2t

Choose next pointy =x+to minimize quadratic approximation

</div>
<span class='text_page_counter'>(7)</span><div class='page_container' data-page=7>

Gradient descent interpretation

●

</div>
<span class='text_page_counter'>(8)</span><div class='page_container' data-page=8>

Outline

I How to choose step sizes

I Convergence analysis

I Nonconvex functions

</div>
<span class='text_page_counter'>(9)</span><div class='page_container' data-page=9>

Fixed step size

Simply taketk =t for allk =1,2,3, . . ., candivergeift is too big.

Considerf(x) = (10x2

1+x22)/2, gradient descent after 8 steps:

Fixed step size

Simply take

t

=

t

for all

k

= 1,

2,

3, . . .

, can

diverge

if

t

is too big.

Consider

f

(x) = (10x

2<sub>1</sub>

+

x

2<sub>2</sub>

)/2

, gradient descent after 8 steps:

</div>
<span class='text_page_counter'>(10)</span><div class='page_container' data-page=10>

Fixed step size

Can beslowift is too small. Same example, gradient descent after 100

steps:

Can be

slow

if

t

is too small. Same example, gradient descent after

100 steps:

−20

−10

20 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

</div>

Bài giảng Tối ưu hóa nâng cao - Chương 5: Gradient descent

Gradient Descent

Gradient descent

Gradient descent

●

●

●

●

●

●

Gradient descent interpretation

Gradient descent interpretation

Outline

Fixed step size

Fixed step size

Simply take

t

=

t

for all

k

= 1,

2,

3, . . .

, can

diverge

if

t

is too big.

Consider

f

(x) = (10x

+

x

)/2

, gradient descent after 8 steps:

Fixed step size

Can be

slow

if

t

is too small. Same example, gradient descent after

100 steps:

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về