Descent Methods

CPSC 406 – Computational Optimization

Descent methods

  • descent directions
  • line search
  • convergence

Descent directions

\[ \min_x\ f(x) \textt{with} f:\Rn\to\R \quad \text{continuously differentiable} \]

  • directional derivative of \(f\) along ray \(x+\alpha d\)

\[ f'(x;d) = \lim_{\alpha\to 0^+} \frac{f(x+\alpha d) - f(x)}{\alpha} = \nabla f(x)^T d \]

  • \(d\) is a descent direction at \(x\) if

\[ f'(x;d) < 0 \]

  • by continuity, if \(d\) is a descent direction, then for some maximum step \(\bar\alpha\)

\[ f(x+\alpha d) < f(x) \quad \forall \alpha\in(0,\bar\alpha) \]

Generic descent method

Initialize: choose \(x_0\in\Rn\)

For \(k=0,1,2,\ldots\)

  • compute descent direction \(d^{(k)}\)
  • compute step size \(\alpha^{(k)}\)
  • update \(x^{(k+1)} = x^{(k)} + \alpha^{(k)} d^{(k)}\)
  • stop if OPTIMAL or MAXITER


Questions:

  • how to determine a starting point?
  • what are advantages/disadvantages of different directions \(d^{(k)}\)?
  • how to choose step size \(\alpha^{(k)}\)?
  • reasonable stopping criteria?

Gradient descent

\[ x^{k+1} = x^{k} + \alpha^{k} d, \qquad d = -\nabla f(x^{k}) \]

  • negative gradient is a descent direction

\[ f'(x;-\nabla f(x)) = -\nabla f(x)^T \nabla f(x) = -\|\nabla f(x)\|^2 < 0 \]

  • negative gradient is the steepest descent direction of \(f\) at \(x\)

\[ -\frac{\nabla f(x)}{\|\nabla f(x)\|} = \mathop{\rm argmin}_{\|d\|\le1} f'(x;d) \quad\text{(most negative)} \]

proof proceeds via Cauchy-Schwartz inequality: for any vectors \(w,v\in\Rn\), \[ -\|w\|\cdot\|v\| \le w\T v \le +\|w\|\cdot\|v\| \]
and upper (or lower) bound achived if and only if \(w\) and \(v\) are parallel

Gradient method

Initialize: choose \(x_0\in\Rn\) and tolerance \(\epsilon>0\)


For \(k=0,1,2,\ldots\)

  1. choose step size \(\alpha^k\) to approximately minimize \[ \phi(\alpha) = f(x^k - \alpha \nabla f(x^k)) \]

  2. update \(x^{k+1} = x^k - \alpha^k \nabla f(x^k)\)

  3. stop if \(\|\nabla f(x^k)\| < \epsilon\)

Step size selection

step size rules typically used in practice

  • exact (hard, except for quadratic \(f\))

\[\alpha^{k} \in \mathop{\rm argmin}_{\alpha\ge0}\ \phi(\alpha), \qquad \phi(\alpha):=f(x^{k} + \alpha d^{k})\]

  • constant (cheap and easy, but requires analyzing \(f\))

\[\alpha^{k} = \bar\alpha>0 \quad \forall k\]

  • approximate — backtracking linesearch, eg, Armijo (relatively cheap, no analysis required)
    • reduce \(\alpha\) until sufficient decrease in \(f\), ie, with \(\mu\in(0,1)\)

 

  1. set \(\alpha^{k} = \bar\alpha>0\)
  2. until \(f(x^{k} + \alpha^{k} d^{k}) < f(x^{k})+\mu\alpha f'(x;d)\)
    • \(\alpha^{k} \gets \alpha^{k}/2\) (or some other divisor)
  3. return \(\alpha^{(k)}\)

 

constant stepsize

Constant stepsize

  • need to fix \(\bar\alpha>0\) small enough to ensure convergence
  • sufficient condition: choose \(\alpha\) small enough to guarantee \[ f(x^{(k)}+\bar\alpha d^{(k)}) < f(x^{(k)}) \quad \forall k \]

\(f(x)=\half x^2\) with \(x\in\R\). Then

\[ \begin{aligned} x^{k+1} &= x^{(k)} - \bar\alpha \nabla f(x^{(k)})\\ &= x^{(k)} - \bar\alpha x^{(k)}\\ &= (1-\bar\alpha)x^{(k)}\\ &= (1-\bar\alpha)^{k+1}x^{(0)} \end{aligned} \]

if \(\bar\alpha\in(0,2)\) then \(|1-\bar\alpha|<1\) and

\[f(x^{(k)}=\half (1-\bar\alpha)^{2k} (x^{(0)})^2\to0 \textt{as} k\to\infty \]

Constant stepsize — quadratic functions

\[f(x) = \half x\T Hx + b\T x + \gamma, \textt{with} H\succ0\]

  • reliable constant stepsize \(\bar\alpha\) depends on maximum eigenvalue, and observe \[ d\T H d \le \lambda_{\max}(H) \|d\|^2 \quad \forall d\in\Rn \qquad(1)\]

  • behaviour of function value along steepest descent direction \(d=-\nabla f(x)\) \[ \begin{aligned} f(x+\alpha d) &= f(x) + \alpha d\T\nabla f(x) + \half \alpha^2 d\T \nabla^2f(x) d & \quad \text{(exact because $f$ quadratic)} \\ &\le f(x) - \alpha\|\nabla f(x)\|^2 + \half \alpha^2 \lambda_{\max}(H)\|d\|^2 &\quad \text{(by (1)}) \\ &= f(x) - \underbrace{(\alpha-\half \alpha^2 \lambda_{\max}(H))}_{(♥)}\|\nabla f(x)\|^2 \end{aligned} \]

  • if \(♥>0\) then \(f(x+\alpha d)<f(x)\), as required, so choose \[\alpha\in(0,2/\lambda_{\max}(H))\]

Lipschitz smooth functions

for general smooth functions, constant stepsize depends on the Lipschitz constant of the gradient

Definition 1 (L-smooth functions) The function \(f:\Rn\to\R\) is \(L\)-Lipschitz smooth if \[ \|\nabla f(x) - \nabla f(y)\| \le L\|x-y\| \quad \forall x,y\in\Rn \]

example — quadratic functions

\[ f(x) = \half x\T Hx + b\T x + \gamma, \textt{with} H\succ0 \]

  • \(f\) is \(\lambda_{\max}(H)\)-Lipschitz smooth because \[ \begin{aligned} \|\nabla f(x) - \nabla f(y)\| &= \|H(x-y)\| &\quad (=\|(Hx+b)-(Hy+b)\|) \\ &= \|\Lambda U\T (x-y)\| &\quad (H=U\Lambda U\T, \quad UU\T=I) \\ &= \|\Lambda v\| &\quad (v=U\T(x-y)) \\ &= \textstyle\sqrt{\sum_{i=1}^n \lambda_i^2 v_i^2} \\ &\le \lambda_{\max}(H)\|v\|\ \\ &= \lambda_{\max}(H)\|x-y\| &\quad (\|v\|=\|x-y\|) \end{aligned} \]

Second-order L-smooth characterization

If \(f\) is twice continuously differentiable, then \(f\) is \(L\)-Lipschitz smooth if and only if its Hessian is bounded by \(L\), ie, for all \(x\in\Rn\) \[ \nabla^2 f(x) \preceq L I \quad\Longleftrightarrow\quad L I - \nabla^2 f(x) \succeq 0 \] implies that quadratic approximation is a local upper bound

Example — logistic loss

  • given feature/label pairs \((a_i,b_i)\in\Rn\times\{0,1\}\), \(i=1,\ldots,m\), find \(x\) to fit logistic model

\[ \sigma(a^{\intercal}_i x) \approx b_i, \quad \text{where} \quad \sigma(t) = \frac{1}{1+e^{-t}} \]

  • logistic loss problem, and objective gradient and Hessian

\[ \min_x f(x):=-\sum_{i=1}^m b_i\log(\sigma(a_i^\intercal x)) + (1-b_i)\log(1-\sigma(a_i^\intercal x)) \]

\[ \nabla f(x) = A\T r, \quad \nabla^2 f(x) = A\T D A, \quad r = \sigma.(Ax) - b, \quad D = \Diag(r_i(1-r_i))_{i=1}^m \]

  • because diagonals of \(D\) are in \((0,1/4)\), for all unit-norm \(u\),

\[ u\T \nabla^2 f(x) u = u\T(A\T D A)u \le \frac{1}{4} u\T(A\T A)u \le \frac{1}{4} \lambda_{\max}(A\T A) \]

  • so \(f\) is \(L\)-Lipschitz smooth with \(L=\lambda_{\max}(A\T A)/4\)

exact linesearch

Exact linesearch

  • exact linesearch typically only possible for quadratic functions

\[ f(x) = \half x\T Hx + b\T x + \gamma, \textt{with} H\succ0 \]

  • exact linesearch solves the 1-dimensional optimization problem with \(d\) descent dir: \[ \min_{\alpha\ge0} \ \phi(\alpha) := f(x + \alpha d) \]

  • exact step computation:

\[ \begin{aligned} \phi(\alpha) &= \half (x + \alpha d)\T H(x + \alpha d) + b\T (x + \alpha d) + \gamma\\ \\ \phi'(\alpha) &= \alpha d\T H d + x\T H d + b\T d = \alpha d\T H d + \nabla f(x)\T d\\ \\ \phi'(\alpha^*) &= 0 \quad \Longleftrightarrow \quad \alpha^* = -\frac{\nabla f(x)\T d}{d\T H d} \end{aligned} \]

backtracking

Backtracking linesearch (Armijo)

pull back along descent direction \(d^{k}\) until sufficient decrease in \(f\)

  • \(f'(x^k;d^k) < 0\)
  • sufficient descent parameter \(\mu\in(0,1)\)




function armijo(f, ∇f, x, d; μ=1e-4, α=1, ρ=0.5, maxits=10)
    for k in 1:maxits
       if f(x+α*d) < f(x) + μ*α*dot(∇f(x),d)
           return α
       end
       α *= ρ
    end
    error("backtracking linesearch failed")
end;

Convergence of gradient method

\(f:\Rn\to\R\) \(L\)-smooth \[ x^{k+1} = x^k - \alpha^k \nabla f(x^k) \]

with

  • constant stepsize \(\alpha^k = \bar\alpha\in(0,2/L)\)
  • exact stepsize \(\alpha^k=\argmin_{\alpha\ge0} f(x^k+\alpha d^k)\)
  • backtracking stepsize \(\alpha^k\) with \(\mu\in(0,1)\)

guarantee – for all \(k=0,1,2,\ldots\)

  • descent (unless \(\nabla f(x^k)=0\)) \[f(x^{k+1}) < f(x^k)\]

  • convergence \[\|\nabla f(x^k)\| \to 0\]