Descent Methods

CPSC 406 – Computational Optimization

Descent methods

descent directions
line search
convergence

Descent directions

\[ \min_x\ f(x) \textt{with} f:\Rn\to\R \quad \text{continuously differentiable} \]

directional derivative of $f$ along ray $x+\alpha d$

\[ f'(x;d) = \lim_{\alpha\to 0^+} \frac{f(x+\alpha d) - f(x)}{\alpha} = \nabla f(x)^T d \]

$d$ is a descent direction at $x$ if

\[ f'(x;d) < 0 \]

by continuity, if $d$ is a descent direction, then for some maximum step $\bar\alpha$

\[ f(x+\alpha d) < f(x) \quad \forall \alpha\in(0,\bar\alpha) \]

Generic descent method

Initialize: choose $x_0\in\Rn$

For $k=0,1,2,\ldots$

compute descent direction $d^{(k)}$
compute step size $\alpha^{(k)}$
update $x^{(k+1)} = x^{(k)} + \alpha^{(k)} d^{(k)}$
stop if OPTIMAL or MAXITER

Questions:

how to determine a starting point?
what are advantages/disadvantages of different directions $d^{(k)}$?
how to choose step size $\alpha^{(k)}$?
reasonable stopping criteria?

Gradient descent

\[ x^{k+1} = x^{k} + \alpha^{k} d, \qquad d = -\nabla f(x^{k}) \]

negative gradient is a descent direction

\[ f'(x;-\nabla f(x)) = -\nabla f(x)^T \nabla f(x) = -\|\nabla f(x)\|^2 < 0 \]

negative gradient is the steepest descent direction of $f$ at $x$

\[ -\frac{\nabla f(x)}{\|\nabla f(x)\|} = \mathop{\rm argmin}_{\|d\|\le1} f'(x;d) \quad\text{(most negative)} \]

proof proceeds via Cauchy-Schwartz inequality: for any vectors $w,v\in\Rn$, \[ -\|w\|\cdot\|v\| \le w\T v \le +\|w\|\cdot\|v\| \]
and upper (or lower) bound achived if and only if $w$ and $v$ are parallel

Gradient method

Initialize: choose $x_0\in\Rn$ and tolerance $\epsilon>0$

For $k=0,1,2,\ldots$

choose step size $\alpha^k$ to approximately minimize \[ \phi(\alpha) = f(x^k - \alpha \nabla f(x^k)) \]
update $x^{k+1} = x^k - \alpha^k \nabla f(x^k)$
stop if $\|\nabla f(x^k)\| < \epsilon$

Step size selection

step size rules typically used in practice

exact (hard, except for quadratic $f$)

\[\alpha^{k} \in \mathop{\rm argmin}_{\alpha\ge0}\ \phi(\alpha), \qquad \phi(\alpha):=f(x^{k} + \alpha d^{k})\]

constant (cheap and easy, but requires analyzing $f$)

\[\alpha^{k} = \bar\alpha>0 \quad \forall k\]

approximate — backtracking linesearch, eg, Armijo (relatively cheap, no analysis required)
- reduce $\alpha$ until sufficient decrease in $f$, ie, with $\mu\in(0,1)$

set $\alpha^{k} = \bar\alpha>0$
until $f(x^{k} + \alpha^{k} d^{k}) < f(x^{k})+\mu\alpha f'(x;d)$
- $\alpha^{k} \gets \alpha^{k}/2$ (or some other divisor)
return $\alpha^{(k)}$

constant stepsize

Constant stepsize

need to fix $\bar\alpha>0$ small enough to ensure convergence
sufficient condition: choose $\alpha$ small enough to guarantee \[ f(x^{(k)}+\bar\alpha d^{(k)}) < f(x^{(k)}) \quad \forall k \]

$f(x)=\half x^2$ with $x\in\R$. Then

\[ \begin{aligned} x^{k+1} &= x^{(k)} - \bar\alpha \nabla f(x^{(k)})\\ &= x^{(k)} - \bar\alpha x^{(k)}\\ &= (1-\bar\alpha)x^{(k)}\\ &= (1-\bar\alpha)^{k+1}x^{(0)} \end{aligned} \]

if $\bar\alpha\in(0,2)$ then $|1-\bar\alpha|<1$ and

\[f(x^{(k)}=\half (1-\bar\alpha)^{2k} (x^{(0)})^2\to0 \textt{as} k\to\infty \]

Constant stepsize — quadratic functions

\[f(x) = \half x\T Hx + b\T x + \gamma, \textt{with} H\succ0\]

reliable constant stepsize $\bar\alpha$ depends on maximum eigenvalue, and observe \[ d\T H d \le \lambda_{\max}(H) \|d\|^2 \quad \forall d\in\Rn \qquad(1)\]
behaviour of function value along steepest descent direction $d=-\nabla f(x)$ \[ \begin{aligned} f(x+\alpha d) &= f(x) + \alpha d\T\nabla f(x) + \half \alpha^2 d\T \nabla^2f(x) d & \quad \text{(exact because $f$ quadratic)} \\ &\le f(x) - \alpha\|\nabla f(x)\|^2 + \half \alpha^2 \lambda_{\max}(H)\|d\|^2 &\quad \text{(by (1)}) \\ &= f(x) - \underbrace{(\alpha-\half \alpha^2 \lambda_{\max}(H))}_{(♥)}\|\nabla f(x)\|^2 \end{aligned} \]
if $♥>0$ then $f(x+\alpha d)<f(x)$, as required, so choose \[\alpha\in(0,2/\lambda_{\max}(H))\]

Lipschitz smooth functions

for general smooth functions, constant stepsize depends on the Lipschitz constant of the gradient

Definition 1 (L-smooth functions) The function $f:\Rn\to\R$ is $L$-Lipschitz smooth if \[ \|\nabla f(x) - \nabla f(y)\| \le L\|x-y\| \quad \forall x,y\in\Rn \]

example — quadratic functions

\[ f(x) = \half x\T Hx + b\T x + \gamma, \textt{with} H\succ0 \]

$f$ is $\lambda_{\max}(H)$-Lipschitz smooth because \[ \begin{aligned} \|\nabla f(x) - \nabla f(y)\| &= \|H(x-y)\| &\quad (=\|(Hx+b)-(Hy+b)\|) \\ &= \|\Lambda U\T (x-y)\| &\quad (H=U\Lambda U\T, \quad UU\T=I) \\ &= \|\Lambda v\| &\quad (v=U\T(x-y)) \\ &= \textstyle\sqrt{\sum_{i=1}^n \lambda_i^2 v_i^2} \\ &\le \lambda_{\max}(H)\|v\|\ \\ &= \lambda_{\max}(H)\|x-y\| &\quad (\|v\|=\|x-y\|) \end{aligned} \]

Second-order L-smooth characterization

If $f$ is twice continuously differentiable, then $f$ is $L$-Lipschitz smooth if and only if its Hessian is bounded by $L$, ie, for all $x\in\Rn$ \[ \nabla^2 f(x) \preceq L I \quad\Longleftrightarrow\quad L I - \nabla^2 f(x) \succeq 0 \] implies that quadratic approximation is a local upper bound

Example — logistic loss

given feature/label pairs $(a_i,b_i)\in\Rn\times\{0,1\}$, $i=1,\ldots,m$, find $x$ to fit logistic model

\[ \sigma(a^{\intercal}_i x) \approx b_i, \quad \text{where} \quad \sigma(t) = \frac{1}{1+e^{-t}} \]

logistic loss problem, and objective gradient and Hessian

\[ \min_x f(x):=-\sum_{i=1}^m b_i\log(\sigma(a_i^\intercal x)) + (1-b_i)\log(1-\sigma(a_i^\intercal x)) \]

\[ \nabla f(x) = A\T r, \quad \nabla^2 f(x) = A\T D A, \quad r = \sigma.(Ax) - b, \quad D = \Diag(r_i(1-r_i))_{i=1}^m \]

because diagonals of $D$ are in $(0,1/4)$, for all unit-norm $u$,

\[ u\T \nabla^2 f(x) u = u\T(A\T D A)u \le \frac{1}{4} u\T(A\T A)u \le \frac{1}{4} \lambda_{\max}(A\T A) \]

so $f$ is $L$-Lipschitz smooth with $L=\lambda_{\max}(A\T A)/4$

exact linesearch

Exact linesearch

exact linesearch typically only possible for quadratic functions

\[ f(x) = \half x\T Hx + b\T x + \gamma, \textt{with} H\succ0 \]

exact linesearch solves the 1-dimensional optimization problem with $d$ descent dir: \[ \min_{\alpha\ge0} \ \phi(\alpha) := f(x + \alpha d) \]
exact step computation:

\[ \begin{aligned} \phi(\alpha) &= \half (x + \alpha d)\T H(x + \alpha d) + b\T (x + \alpha d) + \gamma\\ \\ \phi'(\alpha) &= \alpha d\T H d + x\T H d + b\T d = \alpha d\T H d + \nabla f(x)\T d\\ \\ \phi'(\alpha^*) &= 0 \quad \Longleftrightarrow \quad \alpha^* = -\frac{\nabla f(x)\T d}{d\T H d} \end{aligned} \]

backtracking

Backtracking linesearch (Armijo)

pull back along descent direction $d^{k}$ until sufficient decrease in $f$

$f'(x^k;d^k) < 0$
sufficient descent parameter $\mu\in(0,1)$

function armijo(f, ∇f, x, d; μ=1e-4, α=1, ρ=0.5, maxits=10)
    for k in 1:maxits
       if f(x+α*d) < f(x) + μ*α*dot(∇f(x),d)
           return α
       end
       α *= ρ
    end
    error("backtracking linesearch failed")
end;

Convergence of gradient method

$f:\Rn\to\R$ $L$-smooth \[ x^{k+1} = x^k - \alpha^k \nabla f(x^k) \]

with

constant stepsize $\alpha^k = \bar\alpha\in(0,2/L)$
exact stepsize $\alpha^k=\argmin_{\alpha\ge0} f(x^k+\alpha d^k)$
backtracking stepsize $\alpha^k$ with $\mu\in(0,1)$

guarantee – for all $k=0,1,2,\ldots$

descent (unless $\nabla f(x^k)=0$) \[f(x^{k+1}) < f(x^k)\]
convergence \[\|\nabla f(x^k)\| \to 0\]